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C^ Abstract. In this work we introduce a new linear time compression algorithm, called "Re-pair for Trees", which 

T-H compresses ranked ordered trees using linear straight-line context-free tree grammars. Such grammars general- 

^^ ize straight-line context-free string grammars and allow basic tree operations, like traversal along edges, to be 

Cn executed without prior decompression. Our algorithm can be considered as a generalization of the "Re-pair" al- 

1— H gorithm developed by N. Jesper Larsson and Alistair Moffat in 2000. The latter algorithm is a dictionary-based 

, ^ compression algorithm for strings. We also introduce a succinct coding which is specialized in further compress- 

' ing the grammars generated by our algorithm. This is accomplished without loosing the ability do directly execute 

^^ queries on this compressed representation of the input tree. Finally, we compare the grammars and output files 

^^ generated by a prototype of the Re-pair for Trees algorithm with those of similar compression algorithms. The 

^_^ obtained results show that that our algorithm outperforms its competitors in terms of compression ratio, runtime 

r^ and memory usage. 

q 

Q 1 Introduction 

^^ 1.1 Motivation 

> 

\Q Trees are nowadays a common data structure used in computer science to represent data hierar- 

O chically. This is, for instance, evidenced by XML documents which are widely used after their 
introduction in 1996. They are sequential representations of ordered unranked trees. When pro- 
cessing trees it is often convenient to hold the tree structure in memory in order to retain fast and 



in 
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O random access to its nodes. However, this often leads to a heavy resource consumption in terms 

2 of memory usage due to the necessary pointer structure which represents the tree structure. The 

r| space needed to load an entire XML document into main memory in order to access it through a 

DOM proxy is usually 3-8 times larger than the size of the document itself [WLH07]. Therefore, 
it is essential for very large tree structures to use a memory efficient representation. 
C^ In [FGK03,BGK03] directed acyclic graphs (DAGs) were proposed to overcome this prob- 

lem. By sharing common subtrees one is able to reduce the size of the in-memory representation 
by a factor of about 10 [BGK03]. One of the most appealing properties of this representation 
is that queries like the ones of the XPath language can be directly executed on the compressed 
representation, i. e., it is not necessary to completely unfold the DAG. 

Later, in [BLM08] so called linear straight-line context-free tree grammars were proposed 
as a more succinct representation of an input tree. These grammars represent exactly one tree 
and generalize the concept of sharing common subtrees to the sharing of repeating tree patterns. 
Most important, this new representation is still queryable, i.e., queries can be evaluated without 
prior decompression. At the same time, the complexity of querying, e.g., using XQuery, stays the 
same as for DAGs [LM06]. 
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Source message: |a|&|c|d|a|6|c 



After the replacement of (a, b): \A\c\d\A\c\ 



Fig. 1: The pair (a, b) is replaced by the new symbol A. 

However, finding tlie smallest linear straight-line context-free tree grammar generating a 
given tree is NP-hard. Already finding the smallest context-free string grammar for a given string 
is NP-complete [CLL+05]. In [BLM08] an algorithm called BPLEX was introduced which gen- 
erates a small linear straight-line context-free tree grammar for a given input tree. On average, 
the resulting grammar is 3.5-4 times smaller than the minimal DAG (in terms of the number of 
edges). An implementation of this algorithm, which processes the underlying tree structure of 
XML documents, was also provided. 

1.2 Main Contribution 

Our main contribution is a compression algorithm, called "Re-pair for Trees", which is based 
on linear straight-line context-free tree grammars. Our investigations show that, regarding our 
test data, the grammars generated by Re-pair for Trees are always smaller than the grammars 
produced by the BPLEX algorithm. In addition, our algorithm 

outperforms BPLEX in terms of runtime and memory usage. Note that especially runtime 
was a huge drawback of the BPLEX implementation. 

The Re-pair for Trees algorithm is a generalization of the "Re-pair" algorithm which was de- 
veloped by Lars SON and Moffat in [LMOO]. The latter algorithm is an offline dictionary-based 
compression method for strings consisting of a simple but powerful phrase derivation method 
and a compact dictionary encoding. A dictionary-based compression algorithm is an algorithm 
where the input message is parsed into a sequence of phrases selected from a dictionary. Since 
the reference to a phrase in the dictionary is more compact than the phrase itself often a consid- 
erable compression can be achieved. Re-pair's dictionary is inferred offline since it is generated 
by considering the whole input message and since it is written out as a part of the compressed 
data so that it is available to the decoder. The name Re-pair stands for "recursive pairing" and 
describes the idea of the algorithm. The latter is to count the frequencies of all pairs formed by 
two adjacent symbols of the source message, replacing the most frequent pair by a new symbol 
(see Fig. 1), updating the frequency counters of all involved pairs and repeating this process until 
there are no pairs occurring twice in the source message. This compression technique allows 
searching the compressed data without prior decompression. 

1.3 Organization of this Work 

In Sect. 3 we explain in detail the two steps of which the Re-pair for Trees algorithm consists. 
We also present a complete example of a run of our algorithm and consider the compressibility 
of special types of trees depending on the maximal rank allowed for a nonterminal. Sect. 4 
explains some of the implementation details for the Re-pair for Trees algorithm which is called 



TreeRePair. In particular, we elaborate on its linear runtime, the internal data structures used 
and its efficient in-memory representation of the input tree. Moreover, in Sect. 5 we present a 
succinct coding which is specialized in further compressing the grammars generated by the Re- 
pair for Trees algorithm without loosing the ability to directly execute queries on this compressed 
representation of the input tree. By using a combination of multiple Huffman codings, a run- 
length coding and a fixed-length coding the resulting file sizes are always smaller than the sizes 
of the files generated by competing compression algorithms when executed on our test data. 
In Sect. 6 we compare the compression results of our implementation of the Re-pair for Trees 
algorithm with several other compression algorithms. In particular, we consider BPLEX and 
"Extended-Repair". The latter algorithm is also based on the Re-pair for strings algorithm and 
was independently developed at the University of Paderbom, Germany [Kri08,BHK10]. 

2 Preliminaries 

In the following, N>o = N \ {0} denotes the set of non-zero natural numbers. For a set X we 
denote by X* the set of all finite words over X. For w = Xia;2 • • • a;„ G X* we define 1^1 = n. 
The empty word is denoted by e. 

We sometimes surround an element of N by square brackets in order to emphasize that we 
currently consider it a character instead of a number. For instance, for the sequence of integers 
222221 we shortly write [2]^[1] instead of 2^1 to clarify that we are not dealing with the fifth 
power of 2. 

2.1 Labeled Ordered Tree 

A ranked alphabet is a tuple (J^, rank), where J^ is a finite set of function symbols and the 
function rank : J^ — )■ N assigns to each a E J-' its rank. Furthermore, we define J-'i = {a E J' \ 
rank(a;) = i}. We fix a ranked alphabet (J^, rank) in the following. An F-labeled ordered tree is 
a pair t = (domt, A^), where 

(1) domt C N^Q is a finite set of nodes, 

(2) At : domt -> J", 

(3) ifw = vv' E domt, then also v E domt, and 

(4) if f G domt and \t{v) E J-'n, then vi E domt if and only if 1 < i < n. 

The node e G domt is called the root of t. By \ndex(w), where w = vi E domt\{£} andi G N>o, 
we denote the index i of the node w, i. e., w is the i-th child of its parent node. Furthermore, we 
define parent(w) = v. The size of t is given by the number of edges of which it consists, /. e., we 
have |t| = |domt| — 1. The depth of the tree t is depth(t) = max{|M| | u E domt}. We identify an 
J-'-labeled tree t with a term in the usual way: if At(e) = a E J-'i, then this term is a{ti, . . . , U), 
where tj is the term associated with the subtree of t rooted at node j, where j E {I, . . . ,i}. The 
set of all J^-labeled trees is T{F). 

Example 1. In Fig. 2 an J^-labeled ordered tree t is shown. We have 

domt = {e, 1, 2, 3, 11, 12, 21, 22, 31, 111, 112, 121, 122, 211, 212, 221, 222} . 




Fig. 2: J^-labeled ordered tree i 

We fix a countable set 3^ = {yi,y2, ■ ■ ■} with 3^ fl J-" = of (formal context-) parameters (below 
we also use a distinguished parameter z ^y). The set of all J^-labeled trees with parameters 
from F C 3^ is denoted by T(J^, Y). Formally, we consider parameters as function symbols of 
rank and define T(J', Y) = T(J' U Y). The tree t e T(J', Y) is said to be linear if every 
parameter y E Y occurs at most once in t. By t[yi/ti, . . . ,yn/tn] we denote the tree that is 
obtained by replacing in t for every i E {1,2,. ..,n} every i/j -labeled leaf with tj, where t E 
T{J^, {yi,. .. , yn}) and ti, . . . , t„ E T{J^, Y). A context is a tree C E T{J^, y U {z}) in which 
the distinguished parameter z appears exactly once. Instead of C[z/t] we write briefly C[t]. Let 
t = (dorrit, Xt) E T{J', {yi, . . . , ?/„}) such that for every y^ there exists a node v E dorrii with 
Xt{v) = yi- We say that t is a tree pattern occurring in t' E T{J^, y) if there exist a context 
C E T{T, y U {z}) and trees ti, . . . , t„ G T{T, y) such that 

C[t[yi/ti,y2/t2, . . . ,yn/tn]] = t' . 

2.2 SLCF Tree Grammar 

For further consideration, let us fix a countable infinite set A/i of symbols of rank i E N with 
J-'i n J\fi = ij) and y n Ao = 0. Hence, every finite subset A^ C |Jj>g A/i is a ranked alphabet. 
A context-free tree grammar (over the ranked alphabet T) or short CF tree grammar is a triple 
Q = (iV,P, 5), where 

(1) A^ C |Jj>Q A/i is a finite set oi nonterminals, 

(2) P (the set of productions) is a finite set of pairs (A — )■ t), where A E N, t E T{F U 
N,{yi,..., |/rank{A)}), t ^ y, cach of the parameters yi, . . . , i/rank(A) appears in t, and^ 

(3) Sen is the start nonterminal of rank 0. 

We assume that every nonterminal B E A^ \ {S*} as well as every terminal symbol from F occurs 
in the right-hand side t of some production (A ^ t) E P. 

Let us define the derivation relation =^g on T(J^ U N^ y) as follows: s =^g s' iff there exists 
a production (A ^ t) E P with rank(A) = n, a context C E T{T U iV, 3^ U {z}), and trees 
ti,...,tnE T{TUN,y) such that s = C[A{ti, . . . ,t„)] and s' = C[t[yi/ti ■ ■ -t/n/tj]. Let 



L{g) = {tE t{:f) \s^it}(z t{:f) 



^ In contrast to [LMSS09], our definition of a context-free tree grammar iniierits productivity, i. e.,t ^ y and each parameter 
J/i ) • • • ) yrank(A) appears in t for every (A — > t) G P. Ttiis is justified by the fact that the grammars generated by the Re-pair 
for Trees algorithm are always productive. 



The size \Q\ of the CF tree grammar Q is defined by 

ici= E \t\. 

{A^t)eP 

That means that \Q\ equals the sum of the numbers of edges of the right-hand sides of P's 
productions. We consider the following restrictions on context-free tree grammars: 

- ^ is k-bounded (for A; G N) if rank(A) < k for every A E N. 

- ^ is monadic if it is 1-bounded. 

- ^ is linear if for every (A — )> t) G P the term t is linear. 

Let Q = (N, P, S) be a CF tree grammar. We denote the set of all nodes in the right-hand sides 
of Q's productions which are labeled by the nonterminal A E N hy refg(A), i. e., 

ref g{A) = {{t,v) \3{B ^t) eP -.v E domt A Xt{v) = A} . 

Furthermore, let us define the following relation: 

-^g = {{A, B) E N X N \ {B ^ t) E P A A occurs in t} 

A straight-line context-free tree grammar (SLCF tree grammar) is a CF tree grammar Q = 
{N,P,S), where 

(1) for every A E N there is exactly one production {A ^ t) E P with left-hand side A, and 

(2) the relation --^g is acyclic. 

The conditions (1) and (2) ensure that L(^) contains exactly one tree, which we denote by val(^). 
Let Q be an SLCF tree grammar. We call the reflexive transitive closure of --^g the hierarchical 
order of Q and denote it by ~^*g. 

Example 2. Consider the (linear and monadic) SLCF tree grammar Q = (N, P, S) given by the 
following productions: 

S^f{A{a),A{b),B) 
Mvi) ^ 9{iia,a),i{a,yi)) 
B -^ h{a) 

We have val(^) = t, where t E T{F) is the tree from Example 1 on page 3. 

SLCF tree grammars can be considered as a generalization of the well-known DAGs (see, for 
instance, [LM06] for a common definition). Whereas the latter is a structure preserving compres- 
sion of a tree by sharing common subtrees (see Fig. 3.1 for a depiction), SLCF tree grammars 
broaden this concept to the sharing of repeated tree patterns in a tree (see Fig. 3.2). Actually, a 
DAG can be considered as a 0-bounded SLCF tree grammar. 

Let Q = {N, P, S) be a linear SLCF tree grammar. We define the function 

sayg{A) = \refg{A)\ ■ {\t\ - rank(A)) - |t|, (1) 



Fig. 3.1: A tree t containing two oc- Fig. 3.2: A tree t containing two oc- 

currences of tlie very same subtree t'. currences of tfie tree pattern p. 



which computes for every production (A — )■ t) G P its contribution to a small representation 
of the tree val(^) by the linear SLCF tree grammar Q. The value savg{A) specifies the number 
of edges by which the production with left-hand side A reduces the size of the grammar Q. 
However, savg is not restricted to positive values. In particular, for a production {A ^ t) E P 
with |refg(A)| = 1 we have savg(A) = — rank(A). Thus, a production which is only referenced 
once can be safely removed from the grammar without increasing the size of Q. 

Context-free tree grammars [CDG+07] and especially SLCF tree grammars have been thor- 
oughly studied recently. In theory, SLCF tree grammars in theory can be exponentially more 
succinct than DAGs [LM06], which already can achieve exponential compression ratios. Further- 
more, in [LM06] various membership and complexity problems were considered. It was shown 
that in many cases the same complexity bounds hold as for DAGs. In particular, it was pinpointed 
that for a given nondeterministic tree automaton A and a linear, /c-bounded SLCF tree grammar 
Q it can be checked in polynomial time if val ((y) is accepted by ^ - provided that /c is a constant. 
This is a worth mentioning result since in the context of XML, for instance, tree automata are 
used to type check XML documents against an XML schema (cf. [MLMK05,Nev02]). More- 
over, this result was further improved in [LMSS09], where it was shown that every linear SLCF 
tree grammar can be transformed in polynomial time into a monadic (and linear) one. Together 
with the above mentioned result from [LM06], a polynomial time algorithm for testing if a given 
nondeterministic tree automaton accepts a tree given by a linear SLCF tree grammar (of arbitrary 
maximal rank for the nonterminals) can be obtained. 

In [BLM08] the so called BPLEX algorithm was presented. It produces for a given 0-bounded 
SLCF tree grammar Qi, i. e., Qi represents a DAG, in time 0(|^i |) an equivalent linear SLCF tree 
grammar Q2, where val(^2) = val((yi) and Q2 is /c-bounded (k is an input parameter). Experi- 
ments have shown that |^2| is approximately 2-3 times smaller than \Qi\. 

Moreover, in [LMSS09] it was proved that the evaluation problem for core XPath (the nav- 
igational part of XPath) over SLCF tree grammars is PSPACE-complete just as this was proved 
earlier for DAG-compressed trees by Frick, Grohe and KoCH in [FGK03]. The evaluation 
problem for XPath asks whether a given node in a given tree is selected by a given XPath expres- 
sion. This result is remarkable since with SLCF tree grammars one achieves better compression 
ratios than with DAGs. 
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<book:s> 
<book> 

<author /xtitle /xisbn /> 
</book> 

<book> 

<author /xtitle /xisbn /> 
</book> 
</books> 

Fig. 3: An simplified XML document. 



2.3 XML Terminology 

Regarding XML documents, we use the official terminology introduced in [BPSM"'"08]. Thus 
an XML document contains one or more elements which are either delimited by start-tags and 
end-tags or by an empty-element tag. The text between the start-tag and the end-tag of an element 
is called the element's content. An element with no content is said to be empty. There is exactly 
one element, called root, which does not appear in the content of any other element. 

Example 3. The simplified XML document from Fig. 3 consists of 21 elements of the five types 
books, book, author, title and isbn. The elements of type books and book are de- 
limited by start- and end-tags and exhibit element content. The remaining elements are empty 
elements delimited by empty-element tags. The root of the XML document is the element of type 

books. 

The name in the start- and end-tags of an element give the element's type. Elements can specify 
attributes by using name- value pairs. Consider for instance the element 

<phone pref ix=" 012 ">34 5 6</phone> 

exhibiting one attribute specification with attribute name prefix and attribute value 012. 

In addition to these terms we denote by XML document tree the nested structure of elements 
which is left after removing all character data and attribute specifications from an XML docu- 
ment. 

2.4 Binary Tree Model 

An XML document tree can be considered as an unranked tree, i.e., nodes with the same label 
possibly have a varying number of children. Figure 4 shows the XML document tree of the XML 
document from Example 3. In our case, the XML document tree is a ranked tree, i. e., all nodes 
with the same label exhibit the same number of children. However, the XML document might 
as well have contained an element of type book exhibiting a second author child element. In 
this case, we would have not obtained a ranked tree. 

In the next section we will learn that our Re-pair for Trees algorithm operates on ranked trees 
only. Therefore, in general, a transformation of an XML document tree becomes necessary. A 
common way of modeling such a tree in a ranked way is to transform it into a binary J^-labeled 
ordered tree t by encoding first-child and next-sibling relations. In fact, 



books 




author title isbn author title isbn author title isbn author title isbn author title isbn 
Fig. 4: XML document tree of the XML document listed in Fig. 3 
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4 times 
Fig. 5: Binary tree representation of the XML document tree from Fig. 4. 

- the first child element of an XML element becomes the left child of the node representing its 
parent element and 

- the right sibling element of another element becomes the right-child of the node representing 
its left sibling (cf. Fig. 5). 

Note that a node representing a leaf (resp. a last sibling) of the XML document has no left 
(resp. no right) child in the binary tree model representation. Therefore J^ does not consist of the 
element types of the XML document but of special versions of the element types indicating that 
the left, the right, both or no children are missing. In Fig. 5 this is denoted by superscripts at the 
end of the element types. These superscripts are listed in Table 1 together with their meanings. 
Let us point out that another way of preserving the rankedness along with circumventing the 



Superscript Meaning 

00 no children 

10 no right child 

01 no left child 

1 1 two children 



Table 1: The superscripts and their meanings. 



introduction of special labels with a lower rank is the introduction of placeholder nodes. These 
can be used to indicate missing left or right children. However, our experiments showed that our 
implementation of Re-pair for Trees achieves slightly less competitive compression results in 
this setting. 
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In [BLM08] it was stated that the binary tree model allows access to the next-in-preorder and 
previous-in-preorder node in O (depth), where depth refers to the longest path from the root of 
the XML document to one of its leaves. Furthermore, in [MSV03] it was demonstrated that XML 
query languages can be readily evaluated on the binary tree model. 

3 Re-Pair for Trees 

In this section we study the Re-pair for Trees algorithm in detail. It consists of two steps, namely, 
a replacement step and a. pruning step. Furthermore, a detailed example of a run of our algorithm 
is presented. Finally, we investigate the impact of a possible restriction on the maximal rank 
allowed for nonterminals. 

3.1 Digrams 

In order to be able to elaborate on our Re-pair for Trees algorithm we need the following defini- 
tions. Recall that we have fixed a ranked alphabet F of function symbols, a set M of nonterminals 
and a set y of parameters. We define the set of triples 

77= y {a} X {l,2,...,rank(a)} X (J^UAT) . 

A digram is a triple a = {a,i,b) E U. The symbol a is called the parent symbol of the digram 
a and b is called the child symbol of the digram a, respectively. We define 

par(a;) = rank(a) + rank(6) — 1 and 

pat(a) = a{yi,..., yi_i, b{yi, ..., yj^i),yj, ..., |/par(a)) , 

where j = i + rank(6) and yi,y2, ■ ■ ■ , ypar{a) G 3^- Let m G N U {oo}. We further define the set 

nm = {a e n \ par(a) < m} . 

Obviously, it holds that Uoo = H. We can consider pat(«) as the tree pattern which is rep- 
resented by the digram a. We usually denote digrams by possibly indexed lowercase letters 
a, ai, 0:2, . . . , /3, . . . of the Greek alphabet. An occurrence of the digram a E U within the tree 
t = (dorrii, At) G T(J^ U J\f, y) is a node v G dorrit at which a subtree 

pat(a) [yi/ti, ^2/^2, • • • , y par (a) /t par (a)] , 

with ti, t2, ■ ■ ■ , tpar(Q) £ T{J^ U A/", 3^), is rooted. The set of all occurrences of the digram a in t 
is denoted by OCQ(q;) C donii. 

Let a = (a, i, 6) G 77 and t G T{J^ \J J\f,y). Two occurrences v,w e OCCi(Q;) are overlap- 
ping if one of the following equations holds: v = w,vi = w or wj = v. Otherwise, i. e.,\iv and 
w are not overlapping, v and w are said to be non- overlapping. A subset a C OCCt{a) is said to 
be overlapping if there exist overlapping v,w G a, otherwise it is called non-overlapping. It is 
easy to see that the set OCCt{a) is non-overlapping if a ^ b. In contrast, if we have a = b, the 
set OCCt(a) potentially contains overlapping occurrences. Consider the following example: 




Fig. 6: Tree t G T{J^) consisting of nodes labeled by the terminal / € J-2 and the subtrees ii , t2 , . . . , is £ T{J^). We have to 
deal with overlapping occurrences of the digram (/, 2, /). 



Example 4. Let t E T(J^) be the tree depicted in Fig. 6 and let a = (/, 2, /). Hence, {e, 2, 22} C 
OCCt(a), where on the one hand e and 2 and on the other hand 2 and 22 are overlapping occur- 
rences of a. 

Let a E n and t E T{F U A/", 3^). Let a C OCCt{a) be a non- overlapping set. Furthermore, let 
us assume that a U {v} is overlapping for all v E OCCt{a) \ a, i.e., a is maximal with respect 
to inclusion among non-overlapping subsets. Then a is not necessarily maximal with respect to 
cardinality. 

Example 5. Consider the tree t E T{F) which is depicted in Fig. 6. Let a = (/, 2, /) E U. We 
have OCCt(a) = {e, 2, 22}. Let a = {2} C OCCt(a). The set a is non-overlapping and a U {v} 
is overlapping for all v E OCCt{a) \ a. However, a is not maximal with respect to cardinality. 
Consider the non-overlapping subset a' = {e, 22} C OCCi(Q;). We have IcI < \a'\. 

Example 5 shows us that we cannot choose an arbitrary subset a C OCCt(a) which is non- 
overlapping and maximal with respect to inclusion to obtain a set which is maximal with respect 
to cardinality. Let us also point out that the set OCC((a) may contain more than one maximal 
(with respect to cardinality) non-overlapping subset. 

Example 6. Consider the tree /(/(/(a))) over the ranked alphabet J'. The sets {s} and {1} are 
both maximal with respect to cardinality. 

The algorithm retrieve-occurrences (t, a) from Fig. 8 computes one non-overlapping 
subset of OCCi(a) which we denote by occt(a). Lemma 1 ascertains that this subset is maximal 
with respect to cardinality. Using the function next-in-postorder listed in Fig. 7 we tra- 
verse the tree t in postorder. We begin by passing the parameters t and s and obtain the first node 
u E dovr\t of t in postorder. The second node in post order is obtained by passing the parameters 
t and u. This step can be repeated to traverse the whole tree t in postorder. For every node v 
which is encountered during the postorder traversal it is checked if v is an occurrence of a and 
if it is non-overlapping with all occurrences already contained in the current set occt(a). If both 
conditions are fulfilled, the node v is added to occf (a). 
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1 FUNCTION next-in-postorder (t,«) // let i=(domt,At) 

2 if (w = e) then 

3 II := walk-down (f, v) ; 

4 else 

5 i := index(«) + 1; 

6 V :— parent(w); 
7 

8 if (rank(At(i')) > i) then 

9 V := vi; 

10 u ;= walk-down (i, v) ; 

11 endif 

12 endif 

13 return v; 

14 ENDFUNC 
15 

16 FUNCTION walk-down (i, v) II let t=(domt,At) 

17 while (true) do 

18 if (rank(At(i')) > 0) then 

19 v-v\; 

20 else 

21 return u; 

22 endif 

23 endwhile 

24 ENDFUNC 



Fig. 7: The algorithm which is used to traverse a tree in postorder. 

1 FUNCTION retrieve-occurrences (t, Q) // let a = (a,i,6) 

2 occt(Q) := 0; u := e; 

3 while (true) do 

4 V := next-in-postorder (i, i") ; 

5 if (v G OCCt(Q) A «i^occt(a)) then 

6 occt(a) := occt(a) U {«} 

7 endif 

8 if (u = e) then 

9 return occt (a) ; 

10 endif 

11 endwhile 

12 ENDFUNC 

Fig.8: The function retrieve-occurrences which is used to construct the set occt(a) for a digram a £ IJ and a tree 

te t{F\jN). 



Now, let us assume that we have constructed the set 0004(0;) C OCCi(Q;) using the function 
retrieve-occurrences. If a 7^ 6 in a = {a,i,b) we have O00i(«) = OCCt{a). In the 
following, we show that the subset 000^ (a) C OCCi(a) is maximal with respect to cardinality. 

Lemma 1. Let a E 11 and t G T(J-' U M ^y). Let o C OCCt(Q;) he non- overlapping and 
maximal with respect to cardinality. Then the equation |ooO((q;)| = \a\ holds. 

Proof. In the following we briefly write "maximal" for "maximal with respect to cardinality". 

Let a = (a, i,b) e n,t e T( J" U Af, y) and a as above. The graph {V, E) with 

V = OCQ(a) U{vi\v e OCQ(«)} and E = {{v,vi)\v e OCQ(a)} 

is a disjoint union of paths. Maximal non-overlapping subsets of OCCt(Q;) exactly correspond 
to maximum matchings in (V, E). Clearly, a path with an odd number of edges has a unique 
maximum matching, whereas a path with an even number of edges has two maximum matchings: 
one containing the first edge (in direction from the root) and one containing the last edge on the 
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path. Intuitively, the algorithm from Fig. 8 finds the maximum matching in {V, E) which contains 
for every path with an even number of edges the last edge in direction from the root. U 

Let t e T{J^ U A/", 3^), a = {a, i,b) e 11 and A e Afpsr{a)- By t[a/A] we denote the tree which 
is obtained by replacing all occurrences from ocCi(a) in the tree t by the nonterminal A (in 
parallel). More precisely, we replace every subtree 

pat{a)[yi/ti,y2/t2, • • • ,l/par(a)Apar(a)], 

where ti, t2, . . • , tpar(a) G T{J' U A/", 3^), which is rooted at an occurrence v E occt{a) by a new 

subtree A(ti,t2,...,tpar(a) )• 

Example 7. Consider the tree t E T{F) which is depicted in Fig. 9.1. We have ocC((a) = 
{e, 11, 12, 21, 22}, where a = (/, 2, /). By replacing the digram a in t by a nonterminal A E Mz 
we obtain the tree t[«/A] which is depicted in Fig. 9.2. 

For t E T{J^ U A/") and m e N U (cx)}, we define 

, , \a E n^ if occt(a) ^ and V/3 E U^ : |occt(/3)| < |ocCi(a)| 
maxm(ij = < 

1 undefined ifVa E Ilm '■ 0004(0;) = 

The function max^ : T{J'Uj\f) — )■ 77 associates with every tree t E T{J'Uj\f) a digram a E Ilm 
which occurs in t most frequently (with respect to all digrams from Um)- If there are multiple 
most frequent digrams, we can choose any of them. In contrast, we have maxm(t) = undefined if 
there is no most frequent digram. If m = 00 there is no most frequent pair if and only if the tree 
t consists of exactly one node. Now let us assume that m ^ 00. We have maxm(t) = undefined 
if and only if t consists of exactly one node or if for all digrams a occurring in t it holds that 
a ^ Um. 

In the sequel, if we do not specify the maximal rank allowed for a nonterminal, we always 
assume that m = 00. For convenience we write max(t) instead of maxoo(t), i.e., we omit the 
symbol 00. 
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3.2 Replacement of Digrams 

In this section we introduce the first step of our Re-pair for Trees algorithm, namely, the replace- 
ment step. Let m E NU{oo}be the maximal rank allowed for a nonterminal'* and let the tree 
t = (dorrit, At) G T(J^) be the input of our algorithm. 

We describe a run of the Re-pair for Trees algorithm by a sequence of h+l linear SLCF tree 
grammars Qo,Gi, ■ ■ ■ , Qh, where /i G N. For every i G {0, 1, . . . , /i} we have Qi = {Ni, Pi, S.i), 
(Si — 7- ti) G Pi, ai = maxm(tj) and val(^.j) = t. The grammar Qo contains solely the start 
production (5*0 — t- tg), where tg = t. We obtain the grammar Qi^i by replacing the digram ai 
in the right-hand side of Qi's, start production tj by a new nonterminal Ai^i G A/'par(a,) \ Ni 
(0 < i < /i - 1). We set 

7V,+i = (iV, \ {S^}) U {5,+i, A,+i} and 

P,+i = (P, \{{S^^ U)}) U {(A,+i ^ pat(ai)), (5,+i ^ ti+i)} , 

where tj+i = ti[ai/Ai+i]. 

The computation stops if there is no digram a G Ilm occurring at least twice in the start pro- 
duction of the current grammar, i.e., either the equation |ocCi^(maXm(U))| = 1 or the equation 
maxm(t/i) = undefined holds. In contrast, for all < i < /i — 1 we have |occt. (maxm(tj))| > 1. 

Note that the linear SLCF tree grammar Qh is almost in Chomsky normal form (CNF) as it is 
defined in [LMSS09]. By appropriately transforming the right-hand side of Sh (as it is described 
in the proof of Proposition 5 of [LMSS09]) and introducing a production with right-hand side 
a{yi, . . . , yn) for every terminal a G J-'n (n G N) we would obtain a linear SLCF tree grammar 
which perfectly meets the requirements of the CNF. 

The linear SLCF tree grammar Q^ can only be considered an intermediate result, since it 
potentially consists of productions which do not contribute to a compact representation of the 
input tree t. Therefore, we get rid of unprofitable productions by eliminating them during the so- 
called pruning step. The latter, which is described in the next section, is executed directly after 
the replacement step. 

3.3 Pruning the Grammar 

Let Q = (N, P, S) be a linear SLCF tree grammar. We eliminate a production (A — )> t) from P 
as follows: 

(1) For every reference (t', v) G refg{A) we replace the subtree A(ti, t2, ■ ■ ■ , tn) rooted at f G 
domt/ by the tree 

t[yi/ti,y2/t2,...,yn/tn] , 

where ti, . . . , t„ G T{T U Af, y) and n = rank(A). 

(2) We update the set of productions by setting 

P:=P\{iA^t)} . 



"" Regarding our implementation of the Re-pair for Trees algoritlim wfiichi is described in Sect. 4, m is a parameter which can 
be specified by the user. 
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Let Q = (N, P, S) be the linear SLCF tree grammar generated in the replacement step of our 
algorithm, /. e., we have G = Gh- Let n = | A^| and let 

OJ = Bi, B2, . . . , Bn-l,Bn 

be a sequence of all nonterminals of N in hierarchical order, i. e., the following conditions hold: 

(i) Bn = S 
(ii) yi < i < j < n : Bj ^*g Bi 

Let (Bi — )■ ti), (Bj — > tj) E P, where 1 < i,j < n and i 7^ j. If we eliminate Bi this may have 
an impact on the value of s2Vg{Bj) from (1). We need to differentiate between two cases: 




(1) 5, 

If Bi occurs in tj, i. e., Bi ~~^g Bj, then \tj \ is increased because of the elimination of Bi. At 
the same time, savg{Bj) goes up if we have \refg{Bj)\ > 1. The increase of \tj\ is due to the 
fact that we can assume that the inequality \{v E dom(. | Ai.(f ) ^ y}\ > 2 holds. Every 
production which was introduced in the replacement step represents a digram and therefore 
consists of at least two nodes labeled by the parent and child symbol, respectively, of this 
digram. 




(2) B, -> 

If Bj occurs in ti, i.e., Bj ~^g Bi, then |refg(-Bj)| and therefore savg(Bj) are possibly 
increased by eliminating Bi. In fact, both values go up if \refg{Bi)\ > 1. 

First phase In the first phase of the pruning step, we eliminate every production {A ^ t) E P 
with I ref g (A) | = 1 . That way we achieve not only a possible reduction of the size of G (because 
we have savg(A) = — rank(A) for every A E N referenced only once) but we also decrement 
the number of nonterminals | A^| each time we eliminate such a production. 

Second phase In the second phase of the pruning step we eliminate all remaining inefficient 
productions. We consider a production (A —^ t) E P as inefficient if savg( A) < 0. Unfortunately, 
this time we have to deal with a rather complex optimization problem. In contrast to the first 
phase, the decision whether to eliminate a production {A ^ t) E P or not does now depend 
on the value savg(A). However, the latter may be increased by eliminating other nonterminals 
(see the above case distinction). This forces us to use a heuristic to decide what productions to 
remove next from the grammar. In fact, after completing the first phase, we cycle through the 
remaining productions in their reverse hierarchical order. For every (A — )> t) G P we check if 
savg{A) < 0. If this proves to be true, we eliminate {A — > t). That way \G\ and |A^| are possibly 
further reduced. 

The following example shows that the size of the final grammar generated by the Re-pair 
for Trees algorithm may depend on the order in which possible inefficient productions are elim- 
inated. 
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Example 8. Consider the linear SLCF tree grammar Q = (N, P, S), where A^ = {S, A, B} and 
P is the following set of productions: 

S -^ f{A{a,a),B{A{a,a))) 
A{y^,y2) ^ f{B{y^),y2) 
B{yi) ^ f{yi,a) 

Let us assume that the grammar Q was generated by the replacement step of our algorithm and 
that we now want to remove all inefficient productions. We have savg(A) = —1 and savg{B) = 
0, /. e., the productions with left-hand sides A and B do not contribute to a small representation 
of the input tree val(^). Let us consider the following two cases: 

( 1 ) If we eliminate the production with left-hand side A, we obtain the grammar Qi = {Ni, Pi, Si) 
where A^^i = {Si, Bi} and Pi is the following set of productions: 

Si^ f{f{Bi{a),a),Bi{f{Bi{a),a))) 
Bi{yi) -^ f{yi,a) 

We have \Qi\ = 11 and savgj(_Bi) = 1, i.e., the production with left-hand side Bi is not 
considered inefficient. 

(2) In contrast, the elimination of the production with left-hand side B yields the linear SLCF 
tree grammar Q2 = {N2, P2, S2), where N2 = {S2, A2} and P2 is the following set of pro- 
ductions: 

S2 -> f{A2{a, a), f{A2{a, a), a)) 
^2(1/1,1/2) -^ f{f{yua),y2) 

We also eliminate the production with left-hand side A2 since we have savg2(A2) = 0. This 
leads to an updated grammar Q2 = {N2, P2, S2), where A^2 = {S2} and P2 contains solely 
the production 

^2 -^ /(/(/(a, a), a), f{f{f{a, a), a), a)) . 

We have 1^2! = 12. 

This case distinction shows that the order in which inefficient productions are eliminated has 
an influence on the size of the final grammar (since \Qi\ < |^2|)- Let us consider the sequence 
A, B, S which is the only way to enumerate the nonterminals from N in hierarchical order. 
Due the fact that the above described heuristic cycles through the productions in their reverse 
hierarchical order to eliminate inefficient productions we would obtain the larger grammar Q2 if 
we would execute the pruning step with Q as the input grammar. 

Given the above example one might expect better compression results if the inefficient produc- 
tions are eliminated in the order of their savg-values, /. e., if we would proceed as follows: as long 
as their is a production whose left-hand side has a savg-value smaller or equal to we remove a 
production whose left-hand side has the smallest occurring savg-value. However, our investiga- 
tions showed that this approach leads to unappealing final grammars — at least for our set of test 
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digram a \occt„{a)\ 


digram a ocQi(a) 
(author"\l,Ai) 5 


digram a |occt2(a)| 


(title°\ 1, isbn"") 5 


(book", 1,^2) 4 


(author"!, l,title"i) 5 


(book",l,author"i)4 




(book", 2, book") 2 


(book",l,author''!)4 


(book", 2, book") 2 




(book",2,booki°) 1 


(book", 2, book") 2 


(book", 2, book"') 1 




(book^^l.^a) 1 


(book", 2, book"') 1 


(book"', 1, author"^)! 




(books"', 1, book") 1 


(booki°,l, author"!)! 


(books"', 1, book") 1 






(booksi°,l,book") 1 









Table 2.1: All digrams en- 
countered in the input tree 
to and their number of non- 
overlapping occurrences. 



Table 2.2: All digrams 
encountered in the tree ii 
and their number of non- 
overlapping occurrences. 



Table 2.3: All digrams 
encountered in the tree ^2 
and their number of non- 
overlapping occurrences. 



input trees. The grammars generated by this approach exhibit nearly the same number of edges 
but much more nonterminals (about 50% more) compared to the grammars obtained using the 
above heuristic. 

Note that it is not possible to already detect digrams leading to inefficient productions during the 
replacement step. For instance, we would not act wisely if we would ignore digrams occurring 
only twice and exhibiting a large number of parameters a priori. 

Example 9. Imagine an input tree t E T{F) comprising two instances of a large tree pattern 
t' E T(J^, 3^). Let Af/(f ) 7^ Xt'iu) for all v,u E dorrii/, u ^ v. Furthermore, let us assume that 
all symbols in the tree pattern t' are not occurring outside of this pattern. For every digram a 
occurring in the tree pattern t' (whose replacement may firstly lead to a production with a large 
number of parameters) we would have 1 000^(0;) | = 2. It becomes clear that this great redundancy 
in the input tree t, which can be represented by a production with right-hand side t', would 
not be detected if we would not carry out these initially anything but efficient seeming digram 
replacements. 

3.4 Complete Example 

Let the tree depicted in Fig. 5 be our input tree tg and let there be no restrictions on the maximal 
rank allowed for a nonterminal. We set Qq = (Nq, Pq, Sq), where Nq = {Sq} and Pq solely 
contains the production (5*0 — t- to). Table 2.1 shows every digram a encountered in to along with 
its number of non-overlapping occurrences | ocCip (a) | . Furthermore, this table tells us that the two 
digrams (title*^\ 1, isbn"") and (author"^, 1, title*^^) are the most frequent digrams occuring in tg. 



,00\ 



ttO- 



We decide to replace the former digram and therefore have max(to) = (title , 1, isbn 

Now, in the first iteration of our computation, we generate a new linear SLCF tree grammar 

Qi = (Ni, Pi, Si) as follows. We introduce a new nonterminal Ai E Mq and set Ni = {Si,Ai}. 

After that, we introduce the new production [Ai — )> pat(Q;o)), where pat(Q;o) = title°^(isbn°°). 

Finally, we set Pi = { {Si — )> ti) , (Ai — )> pat(Q;o)) }, where we have ti = to[Q;oMi]- The tree ti 

is depicted in Fig. 10. 

In the second iteration, during which we generate the grammar Q2 = (^2j P2, S2), we have 

max(ti) = (author"^, 1, Ai) =: ai as it can be seen in Table 2.2. Again, we introduce a new 
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■ ^ book" ^ book^o 
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author"^ author"^ 
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Ai Ai 



4 times 
Fig. 10: Tree f i which evolved from the input tree to in the first iteration of our computation. 



books"' 




1 




book" -* ■ ■ 


■ -* book" -* book"' 


1 


1 1 


A2 


A2 A2 



4 times 
Fig. 11: Tree i2 which evolved from the tree ii in the second iteration of our computation. 

nonterminal A2 E A/q with right-hand side pat(ai), set A^2 = {S2, ^1, ^2} and set P = {{S2 — )■ 
t2), (^1 -> pat(ao)), (^2 -> pat(ai))}, where t2 = tifcti/As] (see Fig. 11). Wehave max(t2) = 
(book^^, 1, A2) =: «2 (c/ Table 2.3) in the third iteration of our algorithm. This time, we need to 
introduce a new nonterminal A3 E Mi, i.e., a nonterminal with one parameter, with right-hand 
side pat(«2) = book^^(A2, yi). We obtain the grammar Q3 = {N^, P3, 6*3), where 

Ns = {Ss,Ai,A2,A3} , 

P3 = {P2 \ {{S2 -^ t2)}) U {(^3 ^ ts), (As -^ pat(a2))} and 
t, = hooks'\A,iA,iA,iA,ihook'\A2)m) 

by replacing the 4 occurrences of «2- 

In the fourth and last iteration the digram (A3, 1, A3) is replaced by a new nonterminal A4 E 
A/i. Therefore, we obtain the grammar Q4 = (N^, P^, S4) with 10 edges and 5 nonterminals, 
where we have A^4 = {S'4, Ai, A2, A3, A4} and P4 is the following set of productions: 

^4 -^ books^\A4{A4{book^\A2)))) 
Aiivi) -^ AsiAsivi)) 
A3{yi)^book^\A2,yi) 

A2 -^author°^(Ai) 

Ai -^ title°\isbn°°) 

Finally, in the pruning step, we begin with merging the right-hand side of Ai with the right- 
hand side of A2 since |refg4(Ai)| = I, i.e., it is only referenced once. This yields the updated 
production (A2 — )■ author°^(title°^(isbn'^°))). Furthermore, we roll back the replacement of the 
digram (A3, 1, A3) due to the fact that it does not contribute to the reduction of the total number of 
edges. Although the production with left-hand side A4 is referenced twice in the right-hand sides 
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of Q4 and removes redundancy this gain is neutralized by the necessary edge to the parameter 
node. This is indicated by the savg^ value of A4, see (1): 

saygM^) = |refg,(A4)| ■ i\AsiAsiyi))\ - rankiA,)) - \AsiA,iy^))\ 
= 2- (2-1) -2 = 

With these adjustments we obtain the linear SLCF tree grammar Q = (N, P, S4), where A^ = 
{S4, A2, A3} and P is the following set of productions: 

^4 ^ books'\A,iA,iAsiA3ibook''iA2))m 
A3{yi)^book'\A2,yi) 

A2^author°^(title°i(isbn°°)) 

Compared to the grammar Q4 it has the same number of edges (namely 10) but nearly half as 
much nonterminals only. 

3.5 Another Example 

It is very unlikely to be confronted with an XML document tree which, in the binary tree model, 
is represented by a perfect binary tree^. Nevertheless we want to investigate the compression per- 
formance of our algorithm on this kind of trees since it is an interesting aspect from a theoretical 
point of view. Last but not least our undertaking is justified by the fact that the actual Re-pair for 
Trees algorithm is not restricted to applications processing XML files but can be used in other 
applications as well. The latter, in turn, may exhibit ranked trees similar to full binary trees. 

Let t E T{F) be a sufficiently large perfect binary tree of which each inner node is la- 
beled by a terminal / G J-2 and each leaf is labeled by a terminal a E J-q. A run of Re-pair 
for Trees on t consists of 2 ■ (rf — 1) iterations folding the input tree beginning at its leaves, 
where d = depth (t). Thus, in the first two iterations, the digrams formed by the leaf nodes 
and their parents are replaced. We obtain the productions Ai{yi) — )> f{yi, a) and A2 — )> Ai{a) 
each occurring 2'^^^ times. Now, we undertake further digram replacements in a bottom up fash- 
ion. In the (2i — l)-th and 2i-th iteration we replace two digrams resulting in the productions 
A2i-i{yi) ^ fiyi, A2(i-i)) and A2i -^ Azi- 1(^2(^-1)), respectively, where 2 < i < d - I. 

The production with left-hand side A2k-i occurs only once for every I < k < d — I. There- 
fore, in the pruning step, for every I < k < d — I the production with left-hand side A2k~i is 
eliminated by merging its right-hand side with the right-hand side of the production with left- 
hand side A2k- In particular, the production with left-hand side Ai is merged with the production 
for A2 resulting in a production A2 — )> /(a, a). 

Finally, we obtain a linear SLCF tree grammar with d nonterminals — including the left-hand 
side of the start production S — t- f{A2(d-i), ^2(^-1)) — and a total of 2 ■ rf edges. Note that even 
though some of the intermediate productions exhibit parameters the final grammar consists only 
of nonterminals of rank 0. Thus, the generated grammar is a DAG and in this particular case the 
minimal DAG of the input tree. 



^ A perfect binary tree is a binary tree in which every node is either of rank 2 or and all leaves are at the same level (;. e., the 
paths to the root are of the same length). In contrast, &full binary tree has no restrictions on the level of the leaves, ;. e., the 
only requirement is that every node is either of rank 2 or 0. 
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Example 10. Let t E T{F) be the perfect binary tree from Fig. 12.1 with 30 edges and depth(t) = 
4. A run of Re-pair for Trees initially generates the 6 productions listed in Fig. 12.2. After 
the pruning step we finally obtain the linear SLCF tree grammar Q = {N, P, S), where A^ = 
{A2, A4, Aq, S} and the set of productions P consists of the productions from Fig. 12.3. The size 

of ^ is 1^1 = 8. 

3.6 Unlimited Maximal Rank 

It seems natural to assume that, in general, trees can be compressed best by the Re -pair for Trees 
algorithm if there are no restrictions on the maximal rank of a nonterminal. However, it turns 
out that there are (not so uncommon) types of trees for which the opposite is true. Firstly, in this 
section, we will construct a set of trees whose compressibility is best if there are no restrictions 
on the maximal rank of a nonterminal. After that, in the succeeding section, we will present a set 
of trees whose compressibility is best when restricting the maximal rank to 1. 

Let us consider the infinite set M = {ti,t2, ts, . . .} C T(J^) of trees, where for all i E N>o 
the tree U has the following properties: 

- The tree tj is a perfect binary tree of depth 2\ 

- Each inner node of tj is labeled by the terminal / G J-2. 

- Each leaf of tj is labeled by a unique terminal from J'q, i. e., there do not exist two different 
leaves which are labeled by the same symbol. 

Example 11. Figure 13.1 shows a simplified depiction of the tree ts E M . The inner nodes 
labeled by the symbol / G J-2 are represented by a circle filled with paint. In contrast, the leaves, 
of which each is labeled by a unique symbol from Fq, are depicted by a circle which is not filled 
with paint. 

The tree t^ is compressed by a run of our algorithm as follows. The digrams (/, 1, /) and 
(/, 2, /) occur equally often in ts. It makes no difference to the size of the final grammar whether 
we replace the former or the latter. Let us replace the digram (/, 2, /) (whose occurrences are 
painted in green in Fig. 13.1) by a nonterminal Ai E A/a with right-hand side /(yi, f{y2, Vs))- We 
obtain the tree of the form shown in Fig. 13.2. After that, the digram {Ai, 1, /), which occurs the 
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Fig. 13.1: The tree fs G M. 




Fig. 13.2: The tree t^ £ M after replacing the digram (/, 2, /). 




Fig. 13.3: The tree ig G AI after the second iteration, i. e., after replacing the digrams (/, 2, /) and {Ai, 1, /). 




Fig. 13.4: The tree which remains after replacing the digram {A2, 4, A2) in the tree from Fig. 13.3. 




Fig. 13.5: The tree fa G M after 6 iterations of our algorithm. We obtained a 16-ary tree whose inner nodes are labeled by 
the nonterminal Aq . 
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yi y2 Vr yr + l J/r+2 J/2.r 2/(r-l).r + l y(r-l).r + 2 Vr^ 

Fig. 14: Right-hand side Si of the nonterminal Bi, where r = rank(i3i_i) and i > 1. 

same number of times as (/, 2, /) did, is replaced by the nonterminal A2 E M4 with right-hand 
side Ai (/(yi, y2),y3, Va) ■ The occurrences of (Ai, 1, /) are marked with green paint in Fig. 13.2. 
The right-hand side of the nonterminal Ai is merged with the right-hand side of A2 during the 
pruning step since Ai is only referenced once. This yields the production with left-hand side A2 
and right-hand side / ( / (yi , 1/2 ) , / (ys , 1/4 ) ) • 

After the replacement of the above two digrams the right-hand side of the start production is 
a 4-ary tree of depth 4 whose inner nodes are labeled by A2 (see Fig. 13.3). Now, the digrams 

(A2, 1, A2), (A2, 2, A2), (A2, 3, A2), (A2, 4, A2) 

occur equally often. Again, the order of the digram replacements makes no difference to the final 
grammar. Assuming that at first we replace the digram (A2, 4, A2), which is marked with green 
paint in Fig. 13.3, by a new nonterminal A3, we obtain the tree shown in Fig. 13.4. After that, the 
digrams (A3, 3, A2), (A4, 2, A2) and (A5, 1, A2) are replaced in three additional iterations. The 
above four digram replacements result in a new production 

A&ijju ■■■, yie) -^ ^2(^2(1/1, . . . , 1/4), • • • , A2{yi3, ■■■, Vw)) 

after pruning the grammar. The remaining tree is a 16-ary tree of depth 2 (of the form depicted 
in Fig. 13.5) whose inner nodes are labeled by the nonterminal Aq. In this tree there is no digram 
occurring more than once. Therefore, the execution of our algorithm stops. 

Now, we want to analyze the behavior of Re -pair for Trees on a tree from M in general. Let 
X E N>o and let it : N>o — )> N>o be the following function: 

it(x)=x:2^' 

j=0 

Let i?i, i?2, -B3, • • • be a sequence of nonterminals where for alH > the following conditions 
are fulfilled: 

- rank{Bi) = 2^' 

- Si E T{J^ U J\f, y) is the right-hand side of Bi 

- If i = 1, we have Si = f{f{yi, 1/2)5 fiva, Va)) and if i > 1, the tree Sj is of the form shown in 
Fig. 14, where r = rank(_Bj_i) = 2^' . 
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Bn-1 




r many r many 

Fig. 15: Right-hand side of the nonterminal C, where r — rank(B„_i). 



Regarding the nonterminals A2 and Aq from Example 11, we have Bi 
respectively. Let i G {1, 2, . . . , k}. The following two equations hold: 

rank(S,) = 2^' = 2^''" ■ 2^'"' = vank{B,_,f 
\si\ = rank{Bi) + rank(i?i_i) 



Ao and Bo = A 



65 



(2) 
(3) 



For convenience, we define rank(i?o) = i'arik(/) = 2. 

Now assume that we have an unlimited maximal rank allowed for a nonterminal. After it(n) 
iterations on t„+i G M we have obtained the nonterminals Bi, B2, ■ ■ ■ ,Bn. The right-hand side 
of the start nonterminal is a rank(i?„)-ary tree of height 2 (see also Example 11, where n = 2). 
At this point, no further replacements are carried out. For each of the generated nonterminals 
Bi, . . . , BnWe have 

\refg{B,)\ = rank{B,) + l , 



(4) 



where i G {1, 2, . . . , n} (cf. Fig. 14). Hence, we have 



(1) 



sayg{Bi) = \refg{Bi)\ ■ rank(i?i 



i-i 



Si 



(3) 



(4) 



refg{Bi)\ ■ rank(i?j_i) — rank(i?j) — rank(i?; 



i-i 



rank(_Bj) ■ rank(_Bj_i) — rank(i?j) 
= rank(5i_i)^ - rank(5i_i)^ > 

since rank(fij_i) > rank(i?o) = 2. Therefore, none of the nonterminals Bi, . . . ,Bn will be 
eliminated in the pruning step. 

Now assume that the maximal rank is m G N, i.e., we have m < 00. Choose the smallest 
n eN such that 

2^" >m . (5) 

Thus, Bn is the first nonterminal in the sequence Bi, B2, ■ ■ ■ with a rank bigger than m. Let 
us consider a run of Re-pair for Trees on a tree tj G M with j > n + 1. Then, as above, 
the nonterminals Bi, . . . , -B„_i will be obtained after it(n — 1) iterations (if we would prune 
the corresponding grammar by now). At this point, the right-hand side of the start production 
is a 2^" -ary tree of height '^^/2"-^ > 4, where all inner nodes are labeled by the nonterminal 
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Fig. 16: Right-hand side of the current start production after replacing the digram (A3, 3, A^). 

Bn-i- Now, we can carry out h additional digram replacements leading to the nonterminals 

Ci,C2,...,Ch G A/", where 



h = max{/ G N I r + / • (r — 1) < m} 
and r = rank(i?„_i) = 2^" . We claim that 

r = rank(_B„_i) > h 
holds. To see this, let us assume that r = rank(_B„_i) < h. We have 



(6) 



(7) 



(6) 

m > r + h 



1) > r + r ■ (r — 1] 



-)2" 



However, this contradicts (5). 

In case /i > 0, we can argue as follows: After the pruning step, the nonterminals Ci , C2, . . . , C/i 
form one nonterminal C E A/" with rank(C) = h-r + r — h = r + /i-(r — 1) (see Fig. 15). It occurs 



at least 2^" 



1 



1 many times according to (4) (the nonterminal C occurs as often as Br, 



does after \t{n) iterations on tj in the unlimited case). Each occurrence of C reduces the size of 
the corresponding grammar by h edges and the right-hand side of C consists of r + h ■ r edges 
(see Fig. 15). Now, let us consider the sav-value of C (assuming that Q is the current grammar 
after it(n) + h iterations): 

say g{C) = \refg{C)\ ■ h - {r + h ■ r) 

> (j-"^ + 1) • h — h • r — r 
= (r^ — r + 1) ■ /i — r 

> r^ - 2r + 1 
= {r-lf 

Thus, we have savg(C) > 0, /. e., the nonterminal C is not eliminated during the pruning step. 

Example 12. Let us assume that the maximal rank for a nonterminal is restricted to 10 in Exam- 
ple 1 1 . In this case we are able to undertake exactly one additional digram replacement in the tree 
from Fig. 13.4 resulting in a new nonterminal A4 G A/iq. If we replace the digram (A3, 3, A2), 
we obtain the tree shown in Fig. 16. We have n = 2, h = 2 and C = A4. After the pruning step, 
the right-hand side of A4 is of the form 

^2(^/1, y2, ^2(1/3, 1/4, 1/5, ye), Aivr, 1/8, 1/9, yio)) ■ 
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We can further state that the nonterminal -Bn-i is not eliminated since it occurs h + 1 times in 
the right-hand side of C (see Fig. 15) and [r — h) ■ |refg(C)| > {r — h) ■ (r^ + 1) times in 
the right-hand side of the current start production (below each occurrence of C there are r — h 
occurrences of -Bn-i and C occurs at least r^ + 1 times). Therefore, we have 

\refg{Bn-i)\ >h + l + {r-h)-{r^ + l) 
= h + 1 + r^ — hr'^ + r — h 
= r^ — hr^ + r + 1 . 

Because of (7), the inequality | ref g (i?„_i) | > r +1 holds. As shown before for the unlimited rank, 
in this case -B„_i has a sav-value bigger than and therefore the nonterminals Bi,B2, . . . , -Bn-i 
are not eliminated. 

Let "H™ be the grammar which is obtained after \t{n — I) + h iterations on the tree tj when 
restricting the maximal rank to m and let 1-1°° be the current grammar after it(n) iterations on 
tj when an unlimited rank is allowed. We can conclude that ["H™! > 11-1°° \ holds — no matter 
whether we have h > or h = — because of the following two facts: 

(1) Each occurrence of i?„ saves rank(i?„_i) edges (see Fig. 14) and therefore according to 
(7) more than an occurrence of C does. The nonterminals i?„ and C occur equally often. 
However, C is only existent if h > 0. 

(2) The nonterminals Bi,B2, . . . , -Bn-i (which are existent in both grammars, "H"^ and 7/°°) and 
the nonterminals i?„ and C are not eliminated during the pruning step. 

Let Q™- {Q°°) be the final grammar which is generated by a run of Re -pair for Trees on the 
tree tj when restricting the maximal rank of a nonterminal to m (not restricting the maximal 
rank). We have Q"^ = "H™ and \Q°°\ < 11-1°° \. The latter holds because with every additional 
digram replacement at least one edge is absorbed and because during the pruning step only 
nonterminals with a sav-value smaller than or equal to are eliminated. Therefore \Q"^\ > \Q°°\ 
holds. Thus, we have shown that, in general, the trees from M can be compressed best if there 
are no restrictions on the maximal rank allowed for a nonterminal. 

Example 13. Table 3 shows a comparison of the grammars generated by different runs of our 
algorithm on the trees t2, t^ and t^ from M. By Q'^ {G°°) we denote the final grammar which is 
generated when restricting the maximal rank to 4 (not restricting the maximal rank). 

3.7 Limiting the Maximal Rank 

In the preceding section we investigated a set of trees whose compressibility was best if we 
did not restrict the maximal rank of a nonterminal. Now, we want to construct a set of trees 
which behaves contrarily, i.e., we construct trees which can be compressed best if we limit the 
maximal rank of a nonterminal to L In order to make it easier to quickly understand the following 
definition we want to refer the reader to Fig. 17.1 which shows one of the trees we define in the 
sequel. 
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Tree U 


depth (to 


\U\ 


\Q'\ 


\Q°°\ 


t2 


4 


30 


26 


26 


t3 


8 


510 


346 


298 


u 


16 131070 87386 66090 



Table 3: Comparison of the sizes of the final grammars. 



First of all, let us define a labeling function / : N — )■ J-q, where 

if i = mod 5 



l{{) 



a 
b 
< c 
d 



ifi 



1 
2 
3 

4 



mod 5 
mod 5 
mod 5 
mod 5 



and i eN. Now, we define for all n G N the tree s„ 



dom. 




eT{J^), where 



and 



KM 




iff 
iff 
iff 



[2]S < i < 2" 
[2]^[1], <i <2'^ 
[2]2" . 



Let us define U = {s„ | n G N, n > 3}. In the following we will show that for every run of 
Re-pair for Trees on a tree s G f/ we have \Q^\ < \ Q°° \ , where Q^ is the grammar generated when 
allowing a maximal rank of 1 for a nonterminal and Q°° is the resulting grammar when there is 
no restriction on the maximal rank. 

Let us consider a run Q^,Q^, . . . , G^^i of the Re-pair for Trees algorithm on the tree s„ 
with no restrictions on the maximal rank of a nonterminal, where Qf° = {Ni, Pj, Si), {Si — )■ tj) 
is the start production of Q°° and i G {0,l,...,n— 1}. In the first iteration of our computation 
the digram (/, 2, /) is the most frequent digram, i.e., max(to) = (/, 2, /). This is because of 
|ocCs„ ((/, 2, /)) I = 2"^^ whereas for every x G {a, b, c, d, e} the inequality 

|occ,„((/,l,a:))|< P75l 

holds. Therefore, we replace the digram (/, 2, /) by a new nonterminal Ai and obtain Q^. In 
every subsequent iteration i we replace max(t.j_i) = {Ai_i, 2*~^ + 1, Aj_i) by a new nonterminal 
A j, where i G {2,3, .. . ,n—l}. For every 1 < i < n—1 the right-hand side of the start production 
of the grammar Qf° is given by the tree tj = (dorrii^, A(.), where 



2» 2" 



dom+ 



i=o 



U 



U U [(2' + l)]1*l 
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and 

{Ai e A^s.+i if t; = [T + lY with < j < 2""* - 1 , 

l{j -2' + k-l) if t; = [2^ + iy[k] with < j < 2""^ - 1 and 1 < A; < 2^ , 
/(2") ift;= [2^ + l]2""' . 

Example 14. The Figs. 17.1, 17.2, 17.3 and 17.4 show the right-hand sides of the start produc- 
tions of the grammars Qo,Qi, Q2 and Q^ generated by a run of our algorithm on the tree S4. 

In order to argue that we have max(tj) = (Aj, T + 1, Ai) =: ai for every < ? < n, we 
investigate the number of occurrences of all digrams occurring in the right-hand side of ^°^'s start 
production. Firstly, it is easy to verify that |occt- (ai) \ = 2"~*~^. In contrast, for every 1 < A; < 2* 
and X E {a, b, c, d, e} the inequality \occt^ ((Aj_i, k, a;)) | < [^""ysj holds. This is because every 
power of 2 is not divisible by 5, /. e., for every 1 < /c < 2* and every < j < 2"^* — 5 we have 

\u{[2' + iy[k]) ^ Xu{[2' + iy^'[k]) ^ A,.([2^ + 1P+^[A:]) 

^ A,,([2'' + 1P+-^[A:]) ^ A,,([2^ + 1P+^[A;]) . 

Due to the fact that we do not replace digrams with child symbols a, b, c, d or e, the right-hand 
side of Q^_iS, start production has to contain at least 2" nodes labeled by these symbols, i. e., we 
can conclude that \Q^_i \ > 2"-. Therefore the compression ratio cannot be better than 50%. 

In contrast, a run Qq,GI, . . . ,Ql of our algorithm on the tree s„ leads to a significantly bet- 
ter compression ratio when restricting the maximal rank of a nonterminal to 1, where k E N>o, 
Ql = (iVj, Pi, Si), (Si — )■ ti) is the start production of Q} and i E {0, 1, ... , k}. In the first it- 
eration we have maxi(to) 7^ (/, 2, /), since a replacement of (/, 2, /) would result in a non- 
terminal with a rank greater than 1. Therefore only the digrams (/, l,a), (/, 1,6), (/, l,c), 
(/, 1, d), (/, 1, e) and subsequent digrams can be replaced. It turns out that after the first nine 
iterations the pattern /(a, f(b, f{c, f{d, /(e, . . .)))) is represented by a new nonterminal Ag with 
rank(A9) = 1. The actual order of the replacements within the first nine iterations depends on the 
method used to choose a most frequent digram when there are multiple most frequent digrams. 
Refer to Example 15 for one possible proceeding. 

The right-hand side of ^g's start production is a degenerated tree mainly consisting of con- 
secutive nonterminals Ag. The corresponding nodes — there are roughly 2^/5 of them — are 
then boiled down using approximately log2(2"/5) digram replacements. Therefore the number of 
total edges of the resulting grammar is in 0{n), i.e., it is of logarithmic size (the size of the 
input tree s„ is 2"+^ + 1). Thus, we were able to construct a set of trees which exhibit a better 
compressibility when restricting the maximal rank of a nonterminal to 1. 

Example 15. Let us consider a run of Re -pair for Trees on the tree S4 G f/ when restricting the 
maximal rank of a nonterminal to 1 (see Fig. 18.1 for a depiction of S4). Table 4 shows one of 
several possible orders of digram replacements and the Fig. 18 shows how the right-hand sides 
of the start productions evolve. 
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Fig. 17.1: The tree S4 £ U which is the right- 
hand side of C/o's start production. 
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Fig. 17.2: The right-hand side of Qi 's start produc- 
tion. 



A2 



/ 



A2 



/ 



M 



/ 



A2 



/ 



Fig. 17.3: The right-hand side of Q^i start produc- 
tion. 




Fig. 17.4: The right-hand side of t/3's start produc- 
tion. 
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Fig. 18: The right-hand sides for the nonterminals 5*0, • 
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Iteration 


Replaced digram 


New nonterminal cf. Figure 


1 


if, ha) 


Ai 18.2 


2 


(/,!,&) 


A2 18.3 


3 


(A, 1,^2) 


A3 18.4 


4 


if, he) 


Aa 18.5 


5 


U,hd) 


As 18.6 


6 


(^3,1,^4) 


Aq 18.7 


7 


(^6,1,^5) 


At 18.8 


8 


(/,l,e) 


As 18.9 


9 


(^7,1,^8) 


A9 18.10 



Table 4: A run of Re-pair for Trees on the tree S4 G f/ with a maximal nonterminal rank of 1. 



4 Implementation Details 

We implemented a prototype of the Re-pair for Trees algorithm, named TreeRePair, running on 
XML documents. In the sequel, we demonstrate that it produces for any XML document tree in 
(9(|t|) time a linear A: -bounded SLCF tree grammar Q, where /c G N is a constant, val(^) = t and 
t E T{F) is the binary representation of the input tree. 

There are several reasons to restrict the maximal rank to a constant k. One of them is that only 
this way we are able to obtain a linear-time implementation. Another reason is that for every k- 
bounded linear SLCF tree grammar Q generated by TreeRePair it can be checked in polynomial 
time if a given tree automaton accepts val(^) (using a result from [LM06]). Last but not least. 
Sect. 3.7 on page 24 showed us that for flat XML documents leading to a right- leaning binary 
tree it is quite promising to restrict the maximal rank. The latter reason is also supported by our 
experiments with different maximal ranks on our test set of XML documents. 

On average, a maximal rank of 4 leads to the best compression performance {cf. Sect. 6.7 
on page 64). Due to this fact TreeRePair generates 4-bounded linear SLCF tree grammars by 
default. This can be adjusted by using the -max_rank switch. 

4.1 Reading the Input Tree 

The XML document tree of the input file can be directly transformed into a binary J^-labeled 
tree t = (dom(, A^) G T{F).^ The XML document is parsed by a SAX-like parser calling the 
functions start-element and end-element (see Figs. 20 and 21) of an object taking care 
of the tree construction. The latter is called tree constructor in the sequel. 

The tree constructor uses three stacks to properly encode the SAX events. Firstly, the stack 
index_stack keeps track of the index^ of the current element read. The stack name-stack 
stores the element types of the elements in order to be able to update the labeling function Aj 
within the end-element function. Together with the stack hierarchy_stack, which is 
used to maintain the current sequence of parents within t, enough information stands by to encode 
the SAX events. 



'' Refer to Sect. 2.4 on page 7 for an explanation of the binary tree model. 

' Analogously to our definition for ranked trees: If an element is the n-th child of its parent element, then the index of this 
element is n. 
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1 FUNCTION start-element (name) 

2 if {hierarchy_stack is not empty) then 

3 i := lndex_stack. top {) +1; 

4 index_stack. pop {) ; 

5 index_stack. push {i) ; 
6 

7 V := hierarchy_stack .top () ; 

8 

9 if (i = 1) then u := vl 

10 else u := «2 

11 endif 
12 

13 name_stacic. push (name) ; 

14 else 

15 u :— e; 

16 \t{e) ■- name^° ; 

17 endif 
18 

19 dorrit := dorrit U {u}; 
20 

21 index_stacic. push (0) ; 

22 hierarchy_stack .push, (u) ; 

23 ENDFUNC 



Fig. 20: The start-element function which is called for every start-tag. 

To be more precise, if the parser encounters a start-tag, it extracts the element type of the 
element and passes it to the tree constructor by calling the function start -element. If it is 
the first call of start-element, we must be dealing with the root of the document. Thus, the 
stack hierarchy_stack is empty and the else-part beginning in line 15 is processed. First 
of all, the variable u is identified with e (and later added to the set domi). Afterwards, the labeling 
function Af is updated accordingly. Since, in the binary tree model, the root has no sibling nodes 
and since it is assumed that the input tree consists of at least two nodes, it is clear that the terminal 
symbol labeling the root node will have a left child but no right child (therefore the superscript 
10 in line 16). 

If we consider a subsequent call of start-element, the hierarchy stack is not empty and 
therefore the i f -part is processed. Firstly, the index stack is updated in the lines 3-5 and after 
that the node v E dovr\t is retrieved from the hierarchy stack (line 7). The tree node v will be 
the parent of the node which is added in the following. We introduce a new node u which is 
later (but still in the same call of this function) added to domi (line 19). The node u becomes the 
left child of v if it represents the first child element of the element which is represented by v. 
In contrast, u becomes a right child if the current index i is greater than one, /. e., if the element 
being processed is a sibling element of the element represented by v. Regarding the node u, 
we are unable to update the labeling function Aj at this time since we do not know if the XML 
element being processed has children or sibling elements. 

If an end-tag is encountered by the input parser, the function end-element listed in Fig. 21 
is called. Now, the index of the current XML element is consulted in order to bubble up the se- 
quence of parents stored by the hierarchy stack the correct number of times . Lastly, after process- 
ing the repeat loop, the node representing the first child element of the current XML element 
(the end-tag of its last child element was just read) is on top of the hierarchy stack. For every 
node V G dorrit which is removed from the hierarchy stack within the repeat loop the labeling 
function At is updated. 
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1 FUNCTION end-element 



2 


i := index_stack .top () ; 


3 


repeat i times 


4 


V :— hlerarchy_stack . top ( ) ; 


5 


name := name_stack .top ( ) ; 


6 




7 


Z — 0, r:=0; 


8 


if («1 G dorrit) then 


9 


I := 1; 


10 


endif 


11 


if (u2Gdomt) then 


12 


r ■- 1; 


13 


endif 


14 




15 


At(u) ;= name'^ ; 


16 




17 


hierarchy_stack .pop ( ) ; 


18 


name_stacic. pop ( ) ; 


19 


endrepeat 


20 


index stack.popi); 


21 


ENDFUNC 



Fig. 21: The end-element function which is called for every end-tag encountered in the input XML document. 

Example 16. Fig. 22 shows the evolution of the data structures after the first calls to the functions 
start-element ( ) and end-element ( ) , respectively, when parsing the input tree from 
Fig. 4. It shows the content of the three stacks after the body of the corresponding function has 
been executed, where is denotes the index stack, hs denotes the hierarchy stack and ns denotes 
the name stack. Regarding Fig. 22, the element on top of the stack is always the upper element 
in the depiction of the corresponding stack. If there has not been assigned a label to a node, /. e., 
the labeling function A has not been updated accordingly yet, the node is depicted in brackets. 

The binary representation of the input tree can be obtained in linear runtime since the function 
start-element and the function end-element, respectively, are each called only once for 
every node of the input tree. Furthermore, the body of the repeat loop of the latter function is 
executed once for every input node (except for the root node). 

Re-pair for Trees on Multiary Trees Another way of modeling an XML document tree in a 
ranked way is the multiary tree model. In contrast to the binary tree model (which we described 
in Sect. 2.4 on page 7), this model does not encode the input tree by a binary tree but it turns the 
input tree into a ranked tree by introducing a terminal symbol for each element type/number of 
children combination which occurs in the input tree. Let us assume that an element type occurs 
three times and that there are three different numbers of children attach to the corresponding 
elements. In the multiary tree model, there are introduced three different terminal symbols. 

During our investigations we also evaluated a TreeRePair version based on the multiary tree 
model. However, this modified version of our algorithm was outperformed by the original ver- 
sion in terms of compression ratio. This is due to the nature of typical XML documents. XML 
elements encountered in real-world XML documents often exhibit a long list of children ele- 
ments. Therefore, compared to the binary tree model, a multiary tree model representation of 
an XML document leads to a higher number of different digrams occurring less often. This, in 
turn, reduces TreeRePair's ability to compress the XML document tree by the same degree as it 
is possible for the binary case. 
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(1) Function call start-element (books) 
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(4) Function call end-element ( ) 
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(5) Function call start-element (title) 
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(6) Function call end-element ( ) 
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(10) Function call start-element (book) 
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(11) Function call start-element (book) 
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Fig. 22: Content of the stacks after each call of the start-element () and end-element (), respectively, functions when 
parsing the tree from Fig. 4. In addition at each step their is a depiction of the binary tree which is constructed so far. 
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Example 17. Consider for example the XML document tree from Fig. 4. The element of type 
books has five children elements of type book, /. e., each of the five digrams 

(books, 1, book), (books, 2, book), . . . , (books, 5, book) 

occurs only once. None of these digrams is replaced by TreeRePair since a replacement is only 
reasonable if the corresponding digram occurs at least twice. In contrast, the binary tree model 
leads to two occurrences of the digram (book, 1, book) which can be replaced by a new nonter- 
minal symbol in a run of TreeRePair {cf. Fig. 5). 

4.2 Representing the Input Tree in Memory 

In this section we show that the ranked input tree of our algorithm can be efficiently stored as 
a DAG in memory. This DAG representation can be made nearly transparent to the rest of the 
algorithm {cf. Sect. 4.5 on page 41).*^ Thus, by default, the tree constructor of our prototype does 
not only directly transform the XML document tree into a ranked representation but also infers 
the corresponding minimal 0-bounded SLCF tree grammar Q = (N,P,S), i.e., the minimal 
DAG, of the latter on the fly. 

In [BGK03] it has been demonstrated that the representation of XML document trees based 
on the concept of sharing subtrees is highly efficient. Their experiments have shown that in 
several cases the size of the DAG was less than 10% of the uncompressed XML document tree. 
Therefore, the sharing of common subtrees enables us to load large XML documents trees which 
would have otherwise exceeded the computation resources. In addition to that it avoids time 
consuming swapping and the repetitive re-computation of the same results concerning subtrees 
that are shared. 

Now, let us elaborate on how one can infer the DAG of the ranked representation t = 
(doiTii, A() G T(J^) of the XML document tree. The tree constructor must check for every node 
which is removed from the hierarchy stack in the end-element function if the subtree rooted 
at this node can be shared. This can be accomplished by calling the function share-subtree 
listed in Fig. 23. To better understand this function, let us assume that we want to check if the 
subtree t' E T{F) rooted at a node v G dom^ can be shared. If we already encountered an exact 
copy of t' while reading the input tree, all subtrees of t' must have been shared before. Thus, the 
tree t' must be of depth 1 and all children nodes must be labeled by nonterminals of the DAG 
grammar Q. Therefore, it is only necessary to compare the labels of the root of t' and its direct 
children with those of all subtrees encountered until now. This can be done in constant time with 
the help of a hash table. 

Now, let us assume that we have processed an exact copy of t' earlier, i. e., t' can be shared. 
Thus, the condition in line 3 is evaluated to true and the subtree s_ht hash table contains 
t' . Hence, the else-part beginning in line 6 is processed. If there already exists a nonterminal 
B E N with right-hand side t' then we set A := B. We can check this in 0{l) time because 
with each entry of the hash table subtree s_ht we can store a pointer to the corresponding 



' Note that the DAG representation can also be circumvented by using the -no_dag switch. In this case the whole binary tree 
with all its possible redundancy is constructed in main memory. 
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1 FUNCTION share-subtree (i") 

2 let t' be the subtree rooted at v; 

3 if (VI < i < rank(At(w)) : \t{vi) £ No) then 

4 if (subtrees_ht does not contain t ) then 

5 insert t into subtrees_ht; 

6 else 

7 if {3B e Afo ■■ {B ^ t') e P) then 

8 A--B; 

9 else 

10 choose nonterminal A G A/t) \ A^; 

11 N:=NU{A}; P := P U {{A ^ t')} ; 

12 let u be the node at which the first 

13 occurrence of t is rooted; 

14 replace subtree rooted at u by A; 
15 

16 w :— parent(M); 

17 if {Wl < i < rank{\t{-w)) : Xt{wi) e Afi)) then 

18 let t" be the subtree rooted at w; 

19 insert t into subtrees_ht; 

20 endif 

21 endif 
22 

23 replace subtree rooted at v by A; 

24 endif 

25 endif 

26 ENDFUNC 



Fig. 23: The function share- subtree which checks for the subtree rooted at the node v G dorrit if it can be shared. If this is 
the case then the sharing is performed. 



production. Otherwise, i.e., if there exists no {B — )■ t") G P with t' = t", we introduce a new 
nonterminal A E A/q \ A^ with right-hand side t' and replace the first occurrence u of the subtree 
t' by A. There can be only one earlier occurrence of the subtree t' since otherwise we would 
already have inserted a corresponding production. Furthermore, we can guarantee constant time 
access to u because with each entry in the hash table subtree s_ht we can store a pointer to 
the corresponding first occurrence. Finally, we add the subtree rooted at the node parent(M) to 
the hash table if all of its subtrees are shared. We do not need to insert the subtree rooted at the 
node parent(f ) since we will process parent(f ) in a later step (since we are traversing the input 
tree in postorder). In contrast, if t' was not encountered until now, we add it to the hash table 
subtree s_ht (line 5) in order to be able to share possible later occurrences of it. 

Initially, /. e. , after reading the input tree, all shared subtrees are of depth 1 . In order to reduce 
the number of nonterminals of the DAG grammar (without increasing the number of total edges) 
all productions referenced only once are eliminated. All in all, the inferring of the DAG grammar 
needs linear time and can be conveniently combined with the step of transforming the input tree 
into a ranked tree. 

4.3 Utilized Data Structures 

The data structures we use in our implementation are similar to those used in [LMOO]. In order 
to be able to focus on the essentials, we do not pay attention to the fact that, internally, the input 
tree is represented by a DAG. 

Let us assume that the binary input tree t = (dorrit, ^t) £ ^(-^) has been generated by our 
implementation after reading a corresponding XML document tree. Hence, the tree t is the ranked 

34 



/ 

/ \ 

/ / 

/ \ / \ 

a f a f 

/ \ / \ 

a a a a 

Fig. 24: The tree t G T{T) modeled by the node objects from Fig. 25. 

representation of the latter. In main memory, every node v E dom^ is represented by an object 
exhibiting several pointers. These allow constant time access to the parent and all children of the 
node V and to the possible next and previous occurrences of the digram a = (Ai(f ), i, \t{vi)), 
where i G {1, 2, . . . , rank(At(f ))}. The pointers to the next and previous occurrences of a form 
a doubly linked list of all the occurrences in occt{a). We call this type of list an occurrences list 
(of a) in the sequel.^ The specific order of the occurrences in an occurrences list is not relevant. 

Every digram is represented by a special object. It exhibits two pointers which reference the 
first and the last element of the corresponding occurrences list. Let us consider a digram a E U 
with |ocCi(a)| = m, where m < [y/n\ and n = \t\. Then the corresponding object exhibits two 
more pointers which point to the next and previous, respectively, digram (3 E 11 with \occt{(3) \ = 
m. These pointers form a doubly linked list of all digrams occurring m times. We denote this type 
of list the m-th digram list. In contrast, all digrams 'j E 11 with 1 0004(7) I ^ Lv^J ^^^ organized 
in one doubly linked list which is called the top digram list. 

These doubly linked lists of digrams are again referenced by a digram priority queue. This 
queue consists of [y^J entries. The i-th entry stores a pointer to the head of the i-th. digram 
list, where 1 < i < [y^J • The [v^J -th entry references the head of the top digram list. Refer 
to Sect. 4.4 on page 36 for an explanation on why we designed the digram lists and priority 
queue as described above. Lastly, there is a digram hash table storing pointers to all occurring 
digrams. It allows constant time access to all digrams and therefore constant time access to the 
first occurrence of each digram. 

Let us consider the following example to see how the utilized data structures work. 

Example 18. Let us assume that the tree t = (dorrii, \t) E T{F) shown in Fig. 24 has been gen- 
erated by our implementation after reading a corresponding XML document tree. Then Fig. 25 
shows a simplified depiction of the data structures used to efficiently replace the digrams in the 
replacement step. All non-null pointers are represented by arrows starting in a filled circle and 
ending in an empty circle. A filled circle without an outgoing arrow denotes a null pointer. 

With respect to Fig. 25, there is a total of 11 node objects representing tree nodes labeled by 
the two symbols f E F2 and a E J-q. An instance of a tree node v E don\t is represented by a 
tabular box as it is shown in Fig. 25.1. Unlike depicted, in our implementation a symbol is not 
directly stored within the node structure but for every unique symbol there is an object which 



' During our investigations we also implemented a TreeRePair version avoiding these doubly linked lists of occurrences. In- 
stead, for every digram, we used a hashed set storing pointers to all occurrences. However, this version had no benefits 
compared to the doubly linked list approach but lead to slightly longer runtimes. Considering the memory usage, in some 
cases it achieved better results while in others a substantial increase was noticed. 
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Fig. 25.1: A graphical representa- 
tion of an object representing a tree 
node labeled by / G J^. 



Fig. 25.2: A graphical 
representation of a digram 

if, I, a) en. 



is referenced by the corresponding nodes. The upper left empty circle of the box represents the 
memory address of the tree node instance. Thus, every arrow representing a pointer to the latter 
will end in this empty circle. The filled circle in the first row of the tabular box represents the 
pointer to the possible parent node parent(f ). The pointer to the i-th child vi of the node v is 
depicted by an arrow starting at the filled circle in the i-th column of the children row, where 
i E {1,2,..., rank(Af (f ))}. Analogously, a pointer to a possible next (previous) occurrence of 
the digram a = (At(f ), ?, \t{vi)) is represented by a filled circle in the i-th column of the row 
labeled by next (previous, respectively), where i E {1,2,..., rank(Ai(f ))}. 

Each digram (/, 1, /), (/, 2, a), (/, 2, /) and (/, 1, a) is represented by a tabular box (see 
Fig. 25.2). Again, unlike depicted, in our implementation a symbol is not directly stored within 
the digram structure but the latter contains two pointers to the objects representing a and b. The 
first and the last element of the occurrences list of the digram a are referenced by the first 
and last pointers of the object representing the digram a. The pointers prev (previous) and 
next are part of the |occt(«)|-th digram list if |occt(a)| < [y/n\ and n = \t\. Otherwise they 
belong to the top digram list. 

The digram (/, 1, /) forms a trivial doubly linked list, namely, the 1st digram list. The latter is 
referenced by the entry 1 of the priority queue. The digram (/, 1, a) forms the (trivial) top digram 
list which is referenced by the entry 3 of the priority queue. In contrast, the digrams (/, 2, a) and 
(/, 2, /) each occur twice and therefore point to each other with their next and previous 
pointers, respectively. The first element of the resulting 2nd digram list is referenced by the entry 
2 of the priority queue. The digram hash table stores the pointers to all four occurring digrams. 

4.4 Complexity of the TreeRePair Algorithm 

Theorem 1. For any given input tree with n edges, TreeRePair produces in time 0{\t\) a k- 
bounded linear SLCF tree grammar Q, where k E N is a constant, t E T{J^) is the binary 
representation of the input tree, and val((y) = t. 

It is straightforward to come up with a linear time implementation of the pruning step of the 
Re-pair for Trees algorithm (cf. Sect. 3.3 on page 13). Therefore, we just want to investigate the 
complexity of the replacement step which was described in Sect. 3.2 on page 13. 

With every replacement of a digram occurrence one edge of the input tree is absorbed. There- 
fore, a run of TreeRePair can consist of at most n — I iterations, where n is the size of the input 
tree. Each replacement of an occurrence can be accomplished in 0(1) time since at most k chil- 
dren need to be reassigned — in our implementation, the reassignment of a child node is just a 
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Fig. 25: A simplified depiction of a part of the data structures used by our implementation. 
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1 FUNCTION retrieve-all-occs (t) 

2 V :— e; 

3 while (true) do 

4 11 := next_in_postorder (i, v) ; 

5 if (« / e) then 

6 a:— {\t{parent{v)),\r\dex{v),\t{v)); 
1 if (w^occt(a)) then 

8 ocCf(Q) := occt(a) U {parent(i')} 

9 endif 

10 else 

11 return; 

12 endif 

13 endwhile 

14 ENDFUNC 



Fig. 26: The function retrieve-all-occs which is used to construct the set occt(a) for every digram a G i7 occurring in 
the tree t G T{T UAf). It uses the function next-in-postorder listed in Fig. 7. 

matter of updating two pointers.^" For every production which is introduced during a run of our 
algorithm it holds that the right-hand side t is of size |t| < 2 + k, i.e., it can be constructed in 
constant time. 

However, to show that the replacement step can be performed in linear time two more aspects 
need to be considered. Imagine that we are in the i-th iteration of our algorithm (and Qi^i is the 
current grammar). Let t E T{F U A/") be the right-hand side of ^j-i's start production. 

(1) Updating the sets of non-overlapping occurrences 

In every iteration of our algorithm we need to know the number of occurrences of each 
digram. Only in that case we are able to determine the most frequent digram. In addition, for 
replacing the digram maxfc(t), we need to know ocCi(maxfc(t)). How can we compute the set 
occt(Q;) for every digram a E 11 without traversing the whole right-hand side of the current 
start production in each iteration? 

(2) Retrieving the most frequent digram 

Let us assume that there is an up to date set occt(Q;) available for every a E 11 occurring 
in t (in the form of occurrences lists). How do we determine the most frequent digram in 
constant time? 

In the following we consider each of the above aspects in detail. 

Updating the Sets of Non-overlapping Occurrences Let the binary tree t = (dom(, A() E 
T{F) be our input tree. At the beginning of the replacement step the set occt(Q;) for every digram 
a E n occurring in t is initially constructed. This is done by parsing the tree t in a similar way as 
it is done in the function retrieve-occurrences which is listed in Fig. 8. However, during 
the traversal not only one digram is considered but for every encountered digram a E 11 the set 
ocC((«) is constructed. Fig. 26 shows a possible function which accomplishes this task. 

Therefore, in the first iteration of our computation we have up to date sets of non-overlapping 
occurrences at hand. However, we cannot afford to redo this traversal in every subsequent itera- 
tion. In this case we would not be able to achieve a linear runtime of our algorithm. 



'" As already mentioned at the beginning of this section on page 29: The maximal rank of a nonterminal of a grammar generated 
by TreeRePair is fc G N. The constant k can be specified by a command line switch. 
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a b c /a b c), 

Fig. 27: The tree t' G T[J^). All occurrences which would be absorbed by the replacement are highlighted. 

1 FUNCTION remove-absorbed-occs (t, w, J) 

2 if (u / e) then 

3 a:— (At(parent(u)), index(w), At(i')) ; 

4 occt(a) := occj(q) \ {parent(i')}; 

5 endif 
6 

7 for (/ G {1,2, ...,rank(At(i;))}) do 

8 a--{\t{v),l,\t{vl)); 

9 occi(a) ~ occt(a) \ {v} ; 
10 endfor 

11 

12 for (/ G {1,2, ...,rank(At(uj))}) do 

13 Q := {\t{vj),l,Xt{vjl)); 

14 occt(a) := occt(Q) \ {vj}; 

15 endfor 

16 ENDFUNC 

Fig. 28: Listing of the function remove-absorbed-occs which removes all absorbed occurrences from the occt sets. 

Fortunately, there is another way of keeping track of the sets of non-overlapping occurrences. 
It relies on the fact that every replacement of an digram occurrence v only involves those occur- 
rences in the neighborhood of v which overlap with v. 

Example 19. Let us consider the tree t' = (donii/, Xf) E T{F) which is depicted in Fig. 27. The 
occurrences which would be absorbed by the replacement of the occurrence 2 G dorrit/ of the 
digram (/, 1, g) are highlighted. 

For every digram a E 11 we set occ[{a) := occt{a) and base all upcoming computations on the 
set ocCj(a;). In particular we use them to determine the most frequent digram in each iteration. 

Let us consider the i-th iteration of a run Qq,Qi, ... ,Qh of Re-pair for Trees on the input tree 
t E T(J'), where h E N and i E {1,2,. . . ,h}. Then (?i_i = (A^i_i, Pi-i, Si^i) is the current 
grammar. Let tj_i E T{F) be the right-hand side of Si-i. Let us assume that an up to date set 
occ[ (/3) for every (3 E U which is occurring in tj_i is at hand. Further, let us assume that 
max(tj_i) = (a,j,b) =: a and let t; E occ[ (a). 

Before the actual replacement of the occurrence v we make use of the function listed in 
Fig. 28. The function call remove-absorbed-occs (tj_i,f,j) removes all occurrences 
which will be absorbed by the upcoming replacement from the sets occj _ . After the replace- 
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1 FUNCTION add-new-occs (t, u) 

2 if (w 7^ e) then 

3 a. := (At(parent(u)), index('u), At(u)) ; 

4 occt(a)' :— occJ(q) U {parent(u)}; 

5 endif 
6 

7 for (/ e {1,2, ...,rank(At(u))}) do 

8 a--{\t(u),l,\t(ul)); 

9 occj(a) := occj(a) U {u\; 

10 endfor 

11 ENDFUNC 

Fig. 29: Listing of the function add-new-occs which adds all newly created occurrences to the occt sets. 

/ 

/ \ 

/ 

/ \ 

6 / 

/ \ 

c d 

Fig. 30: Tree t" G T(T) consisting of nodes labeled by the terminal symbols a, h, c,d,f £ J-. We have to deal with three 
overlapping occurrences of the digram (/, 2, /). 



ment of f by a new node u with At-(M) = Ai E J\f we call the function add-new-occs (which 
is listed Fig. 29) and pass the tree ti_i and the node u. The function add-new-occs adds all 
new occurrences which arose by the introduction of u to the sets of non- overlapping occurrences. 
Finally, after all occurrences from occ[ (a) have been replaced, we set occ[.((3) := occ[ (/3) 
for all /3 G il occurring in tj. 

Let a G il be a digram occurring in tj. The above computed set occ[.{a) may not be equal 
to the actual set ocC(. (a) as it would be constructed by a complete postorder traversal of t, using 
the function retrieve-occurrences from Fig. 8. 

Example 20. Consider, for instance, the tree t" G T(J^) depicted in Fig. 30. Let a = (/, 2, /). 
In the first iteration of our algorithm, we would obtain occ[,,(a) := occt"{a) = {2}. Now, let 
us assume that we replace the digram (/, l,c) (we could easily enlarge t" such that (/, l,c) 
is the most frequent digram and still show the same). After performing this replacement and 
especially after calling the functions remove-absorbed-occs and add-new-occs we 
would have occ'^„ (a) = 0. However, a postorder traversal of the updated tree t" would result in 
occt"{a) = {e}. 

Updating the sets of non-overlapping occurrences takes constant time per occurrence replace- 
ment. Atmost2A:+l occurrences need to be removed by the function remove-absorbed-occs 
and at most k + I occurrences need to be added by the function add-new-occs. An occur- 
rence t> of a digram a can be removed from the occurrences list of a in constant time by setting 
the next and previous pointers of the corresponding node object to null. In addition, if v 
is the first (last) occurrence in the occurrence list of a the first (last) pointer of the object 
representing the digram a needs to be updated. This can also be accomplished in constant time 

40 



by using the digram hash table. Analogously, an occurrence can be added to an occurrences list 
in 0(1) time. 

Retrieving the Most Frequent Digram We now investigate the time needed to obtain the most 
frequent digram in an iteration of our algorithm. First of all, let us state the following fact: Let 

m G N U {00} and let Qq, Qi, ■ ■ ■ ,Gn be a run of Re-pair for Trees, where n E N>o, Si = 
{Ni, Pi, Si) and {Si -> U) e Pi for every i e {0,1,. . . , n}. Then 

|occt^(max„(ti))| > |ocCi^_^^(max„(tj+i))| 

holds for every i E {0,l,...,n— 1}.^^ For every digram a E 11 occurring in tj it holds that 
|ocCf^ (a) I > |occt-^^ (a) \ and for every digram (3 E U which was introduced in Qi+i it holds that 
|occt,+,(/3)| < |occt^(maXm(ti))|, where i G {0, 1, . . . ,n - 1}. 

It is easy to see that, if the top digram list is empty, we can obtain the most frequent digram in 
constant time. We just need to walk down the remaining [-^/nj — 1 digram lists and choose the first 
element of the first non-empty list. In every iteration, after we have determined the most frequent 
digram, we remember the first non-empty digram list in order to save ourself the needless and 
time-consuming rechecking of the empty digram lists. 

Now, let us assume that the top digram list, /. e. , the doubly linked list of all digrams occurring 
at least [\/n\ times, is not empty. We need to scan all elements in it since the digrams contained 
are not ordered by their frequency. There can be roughly at most y/n digrams in the top digram 
list. Therefore, we need roughly (9(-\/n) time to retrieve the most frequent digram. However, by 
the replacement of this digram at least [v^J edges are absorbed. It is easy to see that, all in all, 
obtaining the most frequent digram needs constant time on average. 

In a run of TreeRePair we can replace at most n—l digram occurrences and, as shown before, 
the replacement of each occurrence, the update of the sets of non-overlapping occurrences and 
the determination of the most frequent pair can be accomplished in constant time per occurrence 
replacement. Thus, the whole replacement step can be completed in linear time. 

4.5 Impact of the DAG Representation 

In the preceding section, dealing with the complexity of our implementation of the Re-pair for 
Trees algorithm, we did not pay attention to the underlying DAG representation of the input tree. 
This enabled us to concentrate on the essentials. Nevertheless, we have to clarify the impact 
of this representation, particularly concerning the compression performance and the runtime of 
our implementation, since TreeRePair uses it by default. Only by starting TreeRePair with the 
-no_dag switch it forgos the DAG representation and loads the whole input tree into main 
memory. 

Let Q = (N, P, S) be a 0-bounded SLCF tree grammar. We assume without loss of generality 
that for every B E N it holds that B ->^ S. Let (A ^ t) E P,t = (domt, A*) G T{T) and 
V E domt. We define the function unfold using the algorithm listed in Fig. 31. It holds that 

" Intuitively, we define jocct„(maXm(in))| = Oif maxm(^n) = undefined. 
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1 


FUNCTION unfold (5, i,!-) 




2 


let g^{N,P,S) and A ^ f G P; 




3 


if refg(A) / then 




4 


M := 0; 




5 


for each (i', w') ^ i'efg(^) do 




6 


M := Mu{uv\u€ unf old(5, t' 


,v')\; 


7 


endfor 




8 


else 




9 


M --{v}; 




lU 


endif 




11 


return M; 




12 


ENDFUNC 





rig.31: The algorithm which computes unf old(t/, t, v), where we have t G T{J' U A/") and u G domt 



1 FUNCTION retrieve-all-occs-dag (i) 

2 V :— e; 

3 while (true) do 

4 II := next_in_postorder (i, v) ; 

5 if (u / s) then 

6 if (At(u)i^A/') then 

7 a ;= (Aj(parent(u)), index(«), Aj(i')); 

8 if (i" 1^ occ^(a)) then 

9 occj(a) := ocCj(q) U {parent(u)}; 

10 endif 

11 else 

12 let i be the right-hand side of A((w); 

13 if (A(/(e) / At(parent(t))) then 

14 a := (Aj(parent(i;)), inclex(i)), Aj'(e)); 

15 occj(a) := occj(a) U {parent(u)}; 

16 endif 

17 endif 

18 else 

19 return; 

20 endif 

21 endwhile 

22 ENDFUNC 



Fig. 32: The function retrieve-all-occs listed in Fig. 26 adapted for the DAG case. For every a £ 11 the set occj(a) is 
initially set to 0. 

unf old(^, t, v) C domvai(g) and it also holds that 

[J unfold(^,t,-u) = domvai(g) • 
iA^t)eP, 

v£domt 

Let us consider a run Qo,Gi, ■ ■ ■ ,Qh of TreeRePair, where Qi = {Ni, Pi, S'j), {Si — )■ tj) G Pi, 
/i e N and i E {0, 1, ... , h}. Then, in our implementation, tj is represented by a 0-bounded 
(linear) SLCF tree grammar Qi = {Ni, Pi, Si), i.e., we have val(^i) = U, by default. 

Constructing the Sets of Non-overlapping Occurrences In the first iteration of TreeRePair we 
need to construct the set ocCiQ(a) for every digram a E 11 occurring in to. Our first try to ac- 
complish this could be a postorder traversal of all the right-hand sides of Pq's productions using 
the function retrieve-all-occs listed in Fig. 26 on page 38. However, when traversing the 
right-hand sides of the DAG grammar Qq individually, we do not consider occurrences spanning 
two productions of the DAG. 
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a b c a b c 

Fig. 33: The tree t G T{J^) which can be represented by a DAG grammar with productions {S — > f{A,A)) and {A 
g(a,h,c)). 

f 

/ \ 

/ / 

/ \ / \ 

/ \ / \ 

a a a a 

Fig.34: The tree t' G T{J^) which can be represented by a DAG grammar with productions Ai — > /(^2, A2) and A2 
f{a,f{a,a)). 



Example 21. Consider the DAG grammar Q = {N, P, S), where A^ = {S, A} and P contains 
the two productions {S -^ f{A, A)) and {A — )■ g{a, b, c)). It is a compressed representation of 
the tree t E T(J^) depicted in Fig. 33. If we would use the function retrieve-all-occs 
to determine all digram occurrences in the right-hand sides of P's productions, we would not 
capture the node e E domt which is an occurrence for both the digram (/, 1, g) and the digram 
(/,2,^). 

As we have seen, it is necessary to modify the retrieve-all-occs function slightly to also 
take occurrences spanning two productions into account. We use the algorithm listed in Fig. 32 
to obtain the set occ^(q;) for every right-hand side i of Qq's productions and every digram a E 11 
occurring in to. After that, we set 



occ^a) := [J unfold(^,t,t;) 



{A^t)ePo, 

v&occUa) 

We test in line 13 of the retrieve-all-occs function if a has equal parent and child sym- 
bols. If this proves to be true, we do not add the corresponding occurrence to ocCj(a), i. e., we do 
not consider occurrences of a digram with equal parent and child symbols spanning two produc- 
tions of the DAG. If we would do so, we would possibly register overlapping occurrences and 
run into problems during a later replacement of a. Consider the following example: 

Example 22. Consider the DAG grammar Q = (N, P, Ai) given by the productions (Ai — )> U) E 
P, where i E {I, 2}, ti = /(A2, A2) and t2 = /(a, f{a, a)). It is a compressed representation 
of the tree t' E T(J^) depicted in Fig. 34. We use the algorithm from Fig. 32 to obtain the 
sets occ[,(a) fori E {1,2} and every digram a E 11 occurring t'. Let us assume that we omit 
the check in line 13, i.e., we also consider occurrences of digrams with equal parent and child 
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/ \ 

/ / 

/ \ / \ 

& / c / 

/ \ / \ 

a / a / 

/ \ / \ 

« / a / 

/ \ / \ 

a a a a 

Fig. 35: Tree t" G r(-^) with seven overlapping occurrences of tlie digram (/, 2, /). 



symbols spanning two productions. The union 



y unfold{g,ti,v) = {e,l,2} 



ie{l,2}, 
veocc[^{(f,2J)) 

contains the overlapping occurrences e and 2 of the digram (/, 2, /). 

The precaution from line 13 leads sometimes to situations in which we replace fewer occurrences 
of a digram with equal parent and child symbols as we would replace when not using the DAG 
representation. 

Example 23. Consider the tree t" E T{J^) from Fig. 35 which can be represented by the DAG 
grammar consisting of the two productions {S — )■ ti) and {A — )> t2) withti = f{f{b, A), f(c, A)) 
and ^2 = /(a, /(a, /(a, a))). After careful counting one can tell that t" exhibits at most four 
non-overlapping occurrences of the digram a = (/, 2, /). However, if we use the above func- 
tion retrieve-all-occs-dag we only capture three of them. We obtain occ'^ (a) = {e}, 
occ^ (a) = {2} and therefore 



a 



y unf old(^, U, v) = {e, 122, 222} 



ie{l,2}, 
veocc[,{a) 

Even though this approach does not capture all the occurrences which could be captured when 
not using the DAG representation, it still achieves a competitive compression performance on 
our set of test files (cf. Sect. 6.6 on page 62). It seems that a more involved method of dealing 
with digrams with equal parent and child symbols spanning two productions would necessitate a 
partial unfolding of the DAG. The latter, however, would certainly result in a longer runtime. 

Updating the Sets of Non-overlapping Occurrences Considering the graph representation of 
a DAG, a tree node can exhibit multiple parent nodes. In fact, a node has multiple parent nodes 
if it is the root of the right-hand side of a production of the corresponding DAG grammar and if 
this production is referenced multiple times. 
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1 FUNCTION remove-absorbed-occs-dag (t, u, J ) 

2 if (u / e) then 

3 remove-occ-dag {t, parent(?;), index(w) ) ; 

4 else 

5 let A be the right-hand side of t; 

6 for each {i ,u) e refg.(A) do 

7 remove-occ-dag (t , parent(u), index(u)) ; 

8 endfor 

9 endif 
10 

11 for (/ G {1,2, ...,rank(At(w))}) do 

12 remove-occ-dag (i, «, i) ; 

13 endfor 
14 

15 for (/ G {1,2, ...,rank(At(i;j))}) do 

16 remove-occ-dag (i, iij, Z) ; 

17 endfor 

18 ENDFUNC 
19 

20 FUNCTION remove-occ-dag (t,i;,j) 

21 if {\t{vj)(^J\f) then 

22 a — (A^(u),j, A^(wj)); 

23 else 

24 let t be the right-hand side of Xf{vj); 

25 Q — (Aj(u),j, Aj/(e)); 

26 endif 

27 ocCj(q) := occ^(a) \ {«}; 

28 ENDFUNC 



Fig. 36: Listing of the function remove-absorbed-occs-dag which removes all absorbed occurrences from the occ^ sets 
when using the DAG mode. 



To capture all digram occurrences which are absorbed by the replacement of a digram we 
need to take care of the above fact. The remove-absorbed-occs function listed in Fig. 28 
needs to be adapted accordingly. Instead of removing one occurrence formed by the node being 
replaced and its parent, we need to iterate over possibly multiple parents and remove all corre- 
sponding occurrences. In Fig. 36 the function remove-absorbed-occs-dag is listed which 
incorporates this necessary modification. Analogously, the function add-new-occs listed in 
Fig. 29 must be modified to work properly in the DAG mode. Fig. 37 shows an adapted version. 

It is easy to see that our linear runtime is not negatively affected by this loop over all parents. 
Far from it — as mentioned earlier, the DAG representation saves us time by avoiding repetitive 
re-calculations. 



Replacing the Digrams The third and last scenario in which we have to take special care of 
the DAG representation is when replacing an occurrence of a digram a E U spanning two 
productions of the DAG grammar. Due to our restriction on digrams with equal parent and child 
symbols the digram a has to have different parent and child symbols. In the following we want 
to use an example to describe what needs to be done when replacing the digram a. 

Example 24. Consider the DAG grammar given by the productions S — )> f{^g{ti,A),A) and 
A — )■ /i(t2, ts) which represents the J^-labeled tree t depicted in Fig. 38. Imagine that we want 

45 



1 FUNCTION add-new-occs-dag (i, u) 

2 if (u / e) then 

3 add-occ-dag (i, parent(i)), mdex(«)) ; 

4 else 

5 let A be the right-hand side of t; 

6 for each (t ,m) G refg. (A) do 

7 add-occ-dag (i , parent(u), index('u)) ; 

8 endfor 

9 endif 
10 

11 for (/ G {1,2, ...,rank(At(u))}) do 

12 add-occ-dag (i, II, /) ; 

13 endfor 

14 ENDFUNC 
15 

16 FUNCTION add-occ-dag (i,-u, J) 

17 if (Aj(wj) ^ A/") then 

18 a~{Xj{v),j,Xj{vj)); 

19 else 

20 let t be the right-hand side of Xf{vj); 

21 Q:=(A^W,j,A,,(e)); 

22 endif 

23 occj(a) := ocCj(a) U {v} ; 

24 ENDFUNC 



Fig. 37: Listing of the function add-new-occs-dag which adds all new occurrences to the occj sets when using the DAG 
mode. 

/ 
y \ 

a h 

/ \ / \ 

ti h t2 is 

/\ 

Fig. 38: Depiction of the J^-labeled tree t. We have ii, i2, is G T{F). 

to replace the sole occurrence of the digram (/, 2, h), i. e., an occurrence spanning two produc- 
tions.'^ In order to do that we mainly have to complete the following three steps. 

(1) We first have to introduce for every child of the node labeled by /i a new production. Thus, 
we obtain two new productions B -^ t2 and C — )■ ts. We can skip this step for every child 
node which is already labeled by a nonterminal of the DAG grammar. 

(2) We need to update the production with left-hand side Aio A — )■ h(B,C). 

(3) Finally, we introduce a new nonterminal D representing the digram (/, 2, /i) and update the 
production for S to 

S^D{g{h,A),B,C) . 

The above steps are only necessary if the production with left-hand side A is referenced more 
than once. Otherwise we could have directly connected the children of h to the newly introduced 
node labeled by D and removed the production with left-hand side A from the grammar. 



'" For the sake of convenience, our example uses a rather small tree and we decide to replace a digram occurring only once. We 
could easily enlarge t such that (/, 2, h) occurs multiple times and still show the following. 
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Fixed-length coding 



Super Huffman coding 



Run-length coding 



3 Base Huffman codings 



Linear SLCF tree grammar 



Fig. 39: Hierarchy of the employed encodings. 

Since at most k new productions need to be introduced, the replacement of a digram occurrence 
can still be accomplished in constant time. All in all, it has become clear that even when repre- 
senting the input tree of our algorithm as a DAG our implementation runs in linear time. 

4.6 Technical Details on the Prototype 

The source code of the TreeRePair prototype and its documentation is available at the Google 
Code^'^ open source developer site. It can be accessed by visiting the following web page: 

http://code.google.eom/p/treerepair 

However, the implementation should be considered to be of alpha quality. There is still a lot of 
testing to be done. 

We also implemented a decompressor called TreeDePair which is contained in the TreeRePair 
distribution. It is not optimized in terms of time and memory usage. 

The software is licensed under the GPLv3 license which is available at 

http://www.gnu.Org/hcenses/gpl-3.0.txt 

It is implemented using the C++ programming language and can be compiled at least under the 
Windows and Linux operating systems. For compile instructions and library requirements, see 
the README . t xt file in the root directory of the TreeRePair distribution. 

5 Succinct Coding 

In order to achieve a compact representation of the input tree of our TreeRePair algorithm we 
further compress the generated linear SLCF tree grammar by a binary succinct coding. The tech- 
nique we use is loosely based on the DEFLATE algorithm described in [Deu96]. In fact, we use 
a combination of a fixed-length coding, multiple Huffman codings and a run-length coding to 
encode different aspects of the grammar (cf. Fig. 39). 

In spite of the fact that we obtain an extremely compact binary representation of the generated 
SLCF tree grammar we are still able to directly execute queries on it with little effort. Basically, 
we only have to reconstruct the Huffman trees to be able to partially decompress the grammar on 
demand. 

In [MMS08] many different variants of succinct codings specialized in SLCF tree grammars 
were investigated. Among them there was one encoding scheme which turned out to achieve 
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the best compression performance in general — at least with respect to the set of sample SLCF 
tree grammars which was used in this work. However, our experiments show that, regarding the 
SLCF tree grammars generated by TreeRePair, this encoding is outperformed by the succinct 
coding which we present in this section. 

5.1 General Remarks 

In this section, we want to elaborate on the following topics: How do we need to modify the 
pruning step of our algorithm to make our succinct coding as efficient as possible? How does 
TreeRePair efficiently deal with parameter nodes? How can we serialize a Huffman tree in a 
compact way? 

Inefficient Productions Our experiments showed that, at least for our set of test XML docu- 
ments, we achieve better compression results in terms of the size of the output file if we slightly 
modify the pruning step of our algorithm. It turns out that our succinct coding, which we describe 
in the following sections, is most efficient if we prune all productions with a sav-value smaller 
than or equal to 2 (instead of pruning all productions with a sav-value smaller than or equal to 
as it is described in Sect. 3.3 on page 13). However, we use this modification only if we make 
the size of the output file a top priority (by using the switch -optimize files ize). Oth- 
erwise, when optimizing the number of edges of the final grammar (z. e., when using the switch 
-opt imi z e edge s), we stick to the original version of the pruning step. 

Handling of Parameter Nodes Let Q = {N, P, S) be the linear SLCF tree grammar which was 
generated by a run of TreeRePair. Then, for every production (A — )■ t) G P it holds that yi E y 
labels the i-th parameter node of t in preorder, where i E {1,2,..., rank(A)}. Due to this fact 
it is sufficient to represent the parameter symbols yi,y2, ■ ■ ■ , yrank(A) E y hy a single parameter 
symbol y E y. Let {B -^ t') E P be another production and let v E dorrii/ with Xf (f ) = A. Now, 
let us assume that we want to eliminate the production {A — > t) and that we use only a single 
parameter symbol labeling all parameter nodes. It is clear that the i-th (in preorder) parameter 
node of t must be replaced by the subtree which is rooted at the i-th child of v. 

Our implementation takes advantage of the above simplification, i. e., it uses only one param- 
eter symbol y for every occurring parameter node. 

Serializing Huffman trees As stated in [Deu96], it is sufficient to only write out the lengths 
of the generated codes to be able to reconstruct a Huffman tree at a later date. However, this 
requires the decompressor to be aware of the following. 

- What symbols are encoded by the corresponding Huffman tree? 

- In what order are their code lengths listed? 

In our case only integers need to be encoded by Huffman codings because we will encode all 
symbols by integers (see Sect. 5.2 on page 49). Hence, it is obvious to use the natural order of 
integers to list the lengths of the generated codes. Let us assume that n G N is the biggest integer 
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Symbol 


Code 


Symbol 


Code 


a 


00 


o 


10 


b 


1 


b 





c 


Oil 


c 


110 


e 


010 


e 


111 



Table 5.1: Huffman coding before Table 5.2: Huffman coding from Ta- 

the reorganization of the codes. The ble 5.1 after the reorganization of the 

letters are listed in their natural order, codes. 
i. e., in alphabetic order. 



which needs to be encoded and which was assigned a code to, respectively. We just need to loop 
over all integers m < n in their natural order and print out the corresponding code length for 
each of it. For every A; < n for which no code was assigned to we print out a code length of 0. 

In order to solely rely on the code lengths there is still something which needs to be consid- 
ered. We are required to assign new codes to the integers based on the lengths of their original 
codes. More precisely, the new code assignment has to fulfill the following two requirements. 

(1) All codes of the same code length exhibit lexicographically consecutive values when order- 
ing them in the natural order of the integers they represent. 

(2) Shorter codes lexicographically precede longer codes. 

This reorganization of the Huffman codes does not affect the compression performance of the 
coding since only codes of the same length are swapped. The following example is based on an 
example from [Deu96]. 

Example 25. Imagine that we want to use a Huffman coding to encode the letters a, h, c and 
e which are each occurring multiple times in a data stream. Let us assume that we obtain the 
Huffman codes listed in Table 5.1. In order to be able to store the corresponding Huffman tree 
by only writing out the lengths of the Huffman codes we need to assign new codes to the letters. 
Table 5.2 shows the newly assigned codes which fulfill the above two requirements (1) and (2). 
Now, let us assume that the decompressor expects the code lengths to be the lengths of codes 
assigned to the letters of the Latin alphabet and that these code lengths are ordered in the natural 
order of the letters they represent. Then, the corresponding Huffman tree can be unambiguously 
represented by the following sequence of code lengths: 2,1,3,0,3. Note that we need to insert a 
code length of at the position of the letter d since there is no code assigned to the letter d. 

5.2 Contents of the Output File 

In this section we want to elaborate on the information which needs to be stored in the output file 
of our algorithm in order to be able to reconstruct the generated linear SLCF tree grammar at a 
later date. We also want to demonstrate how this data can be efficiently represented. However, at 
this time we do not pay attention to the fixed-length, run-length or Huffman codings which are 
employed in a subsequent step of the encoding process. For the sake of simplicity we consider 
these encodings in separate subsections of this section. 
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Bit string Description 

00 rank 

01 rank 1, righit chiild 

10 rank 1, left child 

11 rank 2 

Table 5: The bit strings encoding the children characteristics together with their meaning. 



Let Q = {N, P, S) be the linear SLCF tree grammar which was generated by a run of 
TreeRePair. Before we are able to compile the information which needs to be written out we 
need to assign to every symbol from J'U {N\ {S}) U {y} a unique integer. In fact, we assign to 
every symbol from J^ a unique ID from the set {1, 2, ... , | J^| } C N. We assign the ID | J^| + 1 to 
y, i.e., to the special symbol labeling all parameter nodes in the right-hand sides of P's produc- 
tions. Finally, we associate with every symbol from the set of nonterminals N \ {S} a unique ID 
from the set {| J^| + 2, | J^| + 3, . . . , \J-'\ + \N\}. The IDs are assigned to the nonterminals in such 
a way that the nonterminal A G A^ \ {S*} has a higher ID than the nonterminal B E N \ {S} if 
B -^^ A holds. 

Writing out the Necessary Informations Now, we are able to write out the information needed 
to reconstruct Q in four steps. Bear in mind that the values mentioned below are not directly 
written to the output file but that they are additionally encoded by a combination of multiple 
Huffman codings, a run-length coding and a fixed-length coding later on. 

First step In the first step, we write out the number of terminal symbols \J^\ and the number of 
introduced productions |A^| — 1, ?. e., we are not counting the start production. By handing over 
this information to the decompressor we avoid the insertion of separators marking, for instance, 
the end of the enumeration of elements types (which are written out in the third step). 

Second step In the second step, we directly append a representation of the children characteristics 
of the terminal symbols. By children characteristics we mean their rank and, concerning terminal 
symbols of rank 1, if we are dealing with a left or a right child. ^^ Due to the fact that all terminal 
symbols have a rank of at most two, we can encode this information using two bits per symbol. 
Table 5 lists all the bit strings we use together with a brief description of their meanings. We 
write out the children characteristics as follows: Firstly, we print out a bit string from Table 5 
representing a certain children characteristic. After that we append the number of corresponding 
terminal symbols and finally we enumerate their IDs. We do this for the characteristics 00, 01 
and 10. We omit the enumeration of all terminal symbols with a rank of 2 since their IDs can be 
reconstructed with the information in hand. In fact, we just need to subtract the set of IDs of all 
terminal symbols with children characteristics 00, 01 and 10 from the set of IDs of all terminal 
symbols from F (which is {1,2,..., |-F|}). 



' Consult Sect. 2.4 on page 7 for an explanation on why this information is necessary to reconstruct the input 
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tree. 



Furthermore, it is not necessary to print out the ranks of the nonterminals from N since these 
can be easily reconstructed by counting the number of parameter nodes in the corresponding 
right-hand sides. The latter are written to the output file in the fourth step. 



Third step In this step, we print the element types of the terminal symbols in the ascending order 
of their IDs to the output file. We do this by writing out the ASCII code of every single letter. 
The individual names are terminated by the ASCII character ETX which is assumed not to be 
used within the element types of the terminal symbols. 



Fourth step In this last step we serialize the productions of Q in the ascending order of the IDs 
of their left-hand sides. For every production (A ^ t) E P we just write out the IDs of the labels 
of t's nodes in preorder. We do not need to use special marker symbols to indicate the nesting 
structure of the symbols and their IDs, respectively. When parsing the output file this hierarchy 
can be easily obtained by taking care of the individual ranks of the symbols. 

We can also omit the specification of the left-hand side A since both, its ID and its rank, 
can be reconstructed with the information in hand. Imagine that we are parsing the output file to 
reconstruct the productions of ^. If we are parsing the i-th production, the ID of its left-hand side 
must be |J^| + 1 + i, where i G {1, 2, . . . , \N\}. As already mentioned, the rank of the left-hand 
side can be obtained by counting the parameter nodes in the right-hand side once this has been 
reconstructed. 

Note that it is superfluous to insert separators between the representations of the produc- 
tions from P since their boundaries can be calculated based on the ranks of the symbols. Again, 
imagine that we are trying to reconstruct the productions of P by parsing the output file of our 
algorithm. Let (A — )■ t) G P be the first production we encounter. The tree t can only consist 
of nodes labeled by terminal symbols, /. e., we must have t G T(J^).^'* The ranks of all symbols 
from J^ are known since the necessary information was written to the compressed file in the 
second step. Therefore, we can easily reconstruct t by iteratively parsing the corresponding IDs 
in the output file. While doing so we are also able to count the number of occurrences of the 
symbol y E y in t. Thus, we are aware of the value of rank(A). After that, we proceed with 
decoding the second production (A' -^ t') E P by iteratively parsing the next IDs. We have 
t' E T{F VJ {A}), i.e., the ranks of all occurring symbols are known. That way all productions 
from P can be reconstructed. 

Example 26. In order to get a clear picture of the representation described above we apply the 
previous four steps to the linear SLCF tree grammar Q = {N, P, S4,) over the ranked alphabet 
J' from Sect. 3.4 on page 18, i.e., we have N = {S4, A2, A^} and P is the following set of 



'"* This is due to the fact that we have written out the productions in the ascending order of the IDs of their left-hand sides. These 
IDs were assigned to the nonterminals in such a way that the nonterminal A £ A'^ \ {S} has a higher ID than B £ N \{S} 
if B -^g A holds. Therefore, the right-hand side of {A — > i), which is the first production which was written out, does not 
contain any node labeled by a nonterminal from A^. 
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Symbol 


ID 


books^" 


1 


isbn°° 


2 


titleOi 


3 


author^i 


4 


bookio 


5 


book" 


6 


y 


7 


A2 


8 


Ai 


9 



Fig. 40: All symbols with the ID assigned to them. The symbol y is the symbol used to label the parameter nodes in the right-hand 
sides of P's productions. 



productions: 

^4 ^ hooks'\A,{A^{A^{A^{booV}\A2)m) 
A:,{y)^hooV'\A2,y) 

A2^author°\title°i(isbn°°)) 

First of all, we assign to every symbol from J^ U (A^ \ {5*4}) U {y} a unique ID as it is shown in 
Fig. 40. After that we are able to write out the grammar exactly as described above resulting in 
the value sequence depicted in Fig. 41. We accomplish this task in four steps: 

(1) We begin by writing out the number of terminals (6) directly followed by the number of 
nonterminals minus the start nonterminal (2) — see the values and 1 in the depiction. 

(2) After that the children characteristics of all terminal symbols are written to the file. We begin 
by specifying all terminal symbols of rank (values 2-4). This is done by firstly writing 
out the bit string 00 and the number of corresponding symbols (1). Finally, the ID 2 of the 
terminal symbol isbn°°, which is the sole terminal symbol of rank 0, is listed. 
Analogously, the terminal symbols with children characteristics 01 and 10 are enumerated 
(values 5-12). 

(3) Now, the element types of all terminal symbols are exported to the output file (values 13-46). 
For each of them the decimal value of each ASCII character is written out. The element type 
books, for instance, is encoded by the sequence 98, 111, 111, 107, 115. 

(4) Finally, the productions from P are written out in the ascending order of the IDs of their 
left-hand sides. Thus, the production with left-hand side A2 is serialized as the very first 
production (values 47^9). It is encoded by the unambiguous sequence of IDs 4, 3, 2 rep- 
resenting the terminal symbols author°\ title*^^ and isbn°° of the right-hand side of A2 in 
preorder. Afterwards the remaining productions with left-hand sides A3 (values 50-52) and 
S (values 53-59) are printed to the output file in this order. 



Possible Optimizations Of course, there is still room to further reduce the data which needs to 
be written to the output file. Consider, for instance, terminal symbols of the same element type 
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Fig. 41: Representation of the grammar Q from Example 26. 

but different children characteristics. In the case of our implementation, the element type of these 
symbols is written to the file two or three times in the second step. However, an optimization with 
respect to this redundancy does only lead to marginally better compression results. This is due to 
the fact that typically the major part of the output file is the enumeration of the productions. 

Still regarding the second step, we could at first determine the most frequent children charac- 
teristic and omit the enumeration of all corresponding terminal symbols. This dynamic approach 
certainly leads to a small reduction of the size of the output file compared to always skipping the 
children characteristic 11. 

Another aspect which confesses optimization potential are possible long lists of the parameter 
symbol y which emerge when writing out the right-hand sides of productions with a higher rank. 
In this case, run-length coding can lead to a better compression performance. However, we did 
not further investigate this matter since we focus on generating grammars with nonterminals with 
a maximal rank of 4. 

5.3 Employing Multiple Types of Encodings 

Even though a Huffman tree has to be serialized for every Huffman coding used within our 
output file, we decided in favor of using four distinct Huffman codings. We use three of them for 
encoding 

- the start production, 

- the remaining productions, the children characteristics of the terminal symbols and the num- 
bers of terminals and nonterminals, and finally 

- the names of the terminals. 

In the sequel, we call these three Huffman codings the base Huffman codings. The fourth Huff- 
man coding, which we call super Huffman coding, is used to encode the Huffman trees of the 
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above codings. Our tests with different numbers of Huffman codings revealed that, in general, 
the above approach leads to the best compression results. This is at least true for most of the 
XML test documents we used. 

Base Huffman Codings We serialize the three base Huffman codings by writing out the lengths 
of the generated codes as it is described in Sect. 5.1 on page 48. However, we additionally apply 
a run-length coding and the super Huffman coding to achieve a compact binary representation. 
In Sect. 5.3 on page 54 we elaborate on how exactly the run-length coding works. We briefly call 
the length of a code of a base Huffman coding a base code length in the sequel. Analogously, we 
denote the lengths of the codes of the super Huffman coding by the term super code lengths. 

We output the number of base code lengths in front of every serialized base Huffman coding, 
i. e., in front of every enumeration of base code lengths. That way the decompressor knows how 
many bits are part of this binary representation. Let us point out that this number of code lengths 
is encoded using k bits instead of using the super Huffman coding, where A; G N is a constant 
which is fixed at compile time. We do this due to the following fact. Let n G N be the number 
of code lengths and let us assume that we encode n, which is usually many times larger than the 
maximum over all code lengths, using the super Huffman coding. This would result in a big gap 
of unused integers between the super code lengths and n. This again would lead to a long list of 
O's when storing the super Huffman tree by enumerating its code lengths. In general, this leads 
to a reduced compression performance compared to a fixed-length coding of n using k bits. 

Super Huffman Coding The super Huffman coding will also be stored by the sequence of its 
code lengths. However, the relatively small set of integers is encoded by a fixed-length coding 
using n G N bits, where n is the smallest possible number of bits which can be used to encode 
all super code lengths. More precisely, we serialize the super Huffman coding in three steps: 

(1) First of all, we print out the binary representation of the number n using k bits, where A; G N 
is a fixed number of bits which is specified at compile time. 

(2) Let m G N be the biggest base code length. We print out the binary representation of m 
using k bits. With this information the decompressor knows that the next n ■ m bits make up 
the list of super code lengths. 

(3) Finally, the binary representations of the m many super code lengths are written to the output 
file using n bits for each code length. The super code lengths are printed in the natural order 
of the integers which are represented by the corresponding codes. 

Run-length Coding of the Base Code Lengths In this section we explain the run-length coding 
which is applied to the enumerations of code lengths used to write all base Huffman codings to 
the output file. This additional encoding marks a major contribution to the compactness of our 
representation. The bigger a code length is, the more different codes of that length are possible. At 
the same time a sequence of several occurrences of the same code length within the enumeration 
of all code lengths becomes more likely. In addition, our experience shows that it frequently 
happens that there is a longer run of O's in the list of all code lengths due to symbols which no 
codes were assigned to. 
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Example 27. Consider, for instance, the example from Sect. 5.3 and in particular the base Huff- 
man coding C3 which is listed in Table 6.3 on page 56. This Huffman coding does not assign 
codes to the symbols 4-96. This results in a sequence of 94 zeros within the enumeration of the 
code lengths of C3. 

Definition 1. Let m,k E N, where k > [log2(m)J + 1. In the following we denote by b\nk{m) 
the (0-padded) binary representation h^hk^i . . . &o ofm, i.e., the following holds: 



E". 



m = 

i=0 

We encode an enumeration of code lengths using a run-length coding as follows: Let us assume 
that n e N is the maximum code length. Then we use the three additional integers n+ 1, n+2 and 
n -|- 3 to indicate certain types of runs — we call them run indicators in the sequel. Principally, 
all runs with a length less than or equal to 3 are straightly written to the output file. In contrast, a 
run of a code length m G N exceeding this bound is encoded as follows: 

- If we have m > 0, we use the run indicator n + 1 and a bit string with a length of 2 to indicate 
4-7 repetitions of the code length m. If A; > 3 is the length of the run of m and / = k mod 7 
(/. e., / G {0, 1, . . . , 6}), then this run is encoded as follows: 

• if / > 3: 

m {n + l)bin2(3) (n + l)bin2(/ - 4) 

" V ' 

[fc/rj times 

• if Z < 3: 

m (n+l)bin2(3) [m] 

^ V ' 

[fe/rj times 

Note that [m] denotes I many consecutive m's. 

- If we have m = 0, we use the run indicator n + 2 with an appended bit string of length 3 to 
denote 4-11 repetitions of m. In contrast, we use the run indicator n + 3 together with a bit 
string of length 7 to encode 12-139 repeated O's. 

If A; > 3 is the length of the run of O's and / = k mod 139 (/. e.,l G {0, 1, ... , 138}), then 
this run is encoded as follows: 

• if/ > 11: 

(n + 3)bin7(127) (n + 3)bin7(/ - 12) 



if3 < / < 11: 



if / < 3: 



[fc/issj times 

(n + 3)bin7(127) (n + 2)bin3(/ - 4) 



V 

[fc/i39j times 



(n + 3)bin7(127) [m]' 



V 

[fe/i39j times 
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Symbol 


Old code 


New code 





10 


10 


1 


1110 


1110 


5 


nil 


nil 


8 


no 


no 


9 









Table 6.1: Huffman coding Ci used 
to encode the start production. 



Symbol 


Old code 


New code 


1 


1010 


1100 


2 


01 


00 


3 


00 


01 


4 


111 


100 


5 


1011 


1101 


6 


no 


101 


7 


1001 


1110 


8 


1000 


nil 



Table 6.2: Huffman coding C2 used 
to encode the productions from P \ 
{S}, the children characteristics, and 
numbers of terminals and nontermi- 
nals. 



Symbol 


Old code 


New code 


3 


111 


010 


97 


00101 


11010 


98 


101 


on 


101 


00100 


11011 


104 


00111 


11100 


105 


1001 


1010 


107 


1101 


1011 


108 


noon 


111110 


no 


110010 


111111 


111 


01 


00 


114 


11000 


11101 


115 


1000 


1100 


116 


000 


100 


117 


00110 


lino 



Table 6.3: Huffman coding C3 used 
to encode the names of the terminal 
symbols. 



Symbol 


Old code 


New code 











1 


110000 


111110 


2 


1101 


1110 


3 


101 


100 


4 


111 


101 


5 


100 


no 


6 


11001 


lino 


9 


110001 


111111 



Table 6.4: Super Huffman coding 
used to encode the code lengths of the 
base Huffman codings. 
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Example 28. Consider the following sequence of integers: 

122333 444444 555 000000000 

Now, let us assume that we want to encode the above sequence using our run-length coding. 
Obviously, we have n = 5. The above run of 4's with a length of 6 is represented by the sequence 
4610 since we have n + 1 = 6 and bin2(6 — 4) = 10. In contrast, the run of O's with a length of 

9 leads to the sequence 7101 because it holds that n + 2 = 7 and that bins (9 — 4) = 101. All in 
all, we obtain the sequence 122333 4610 555 7101. 

Surprisingly, our investigations evinced that an approach which dynamically adjusts the length 
of the bit strings used in the above encoding depending on the size of the input grammar does 
not lead to significantly better compression results. 

Example This example continues the encoding of the linear SLCF tree grammar Q from Exam- 
ple 26 on page 51. The Tables 6.1, 6.2 and 6.3 list the three base Huffman codings, called Ci, 
C2 and C3 in the sequel, which are calculated by our implementation. The columns labeled Old 
code show the initial Huffman codes while the columns labeled New code list the newly assigned 
codes after the necessary reorganization described in Sect. 5.1 on page 48. 

While Fig. 41 on page 53 shows the second part of the output file as it is generated by a run 
of TreeRePair the Fig. 42 shows the first part of it. The latter stores the base Huffman codings 
Ci, C2 and C3 together with the corresponding super Huffman coding. For the sake of clarity the 
corresponding values are denoted by their integer representation instead of by their fixed-length 
or Huffman code. The Huffman coding Ci from Table 6.1, for instance, is given by the sequence 
of code lengths ranging from value 13 to value 22, where value 12 informs us about the length 
of this sequence. Analogously, the code lengths of the Huffman codings C2 and C3 are given by 
the values lA-Zl and 34—60, respectively. The sequence of code lengths of the Huffman coding 
C3 exhibits a longer run, namely, 94 consecutive occurrences of the code length 0. This run is 
encoded by the run indicator 9 = n + 3 and the bit string bin7(94 — 12) = 1010010, where n = 6 
is the maximal length of a code from C3. 

The super Huffman coding listed in Table 6.4 is written to the output file (values 2-1 1) using 
3 bits per integer as it is stated by the value of the output file. There need to be enumerated 

10 super code lengths since 10 values — the base code lengths 0, 1, . . . , 6 and the run indicators 
7, 8, 9 which are used by the base Huffman coding C3 — need to be encoded. 



6 Experimental Results 

In the following, we compare the compression performance of our implementation of the Re-pair 
for Trees algorithm with existing algorithms. Furthermore, we will check the impact of the DAG 
representation of the input tree on the compression factors achieved and we will learn about the 
influences of small changes to the maximal rank allowed for a nonterminal. 
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Fig. 42: Depiction of the part of the output file which contains the serialized four Huffman codings. 

6.1 XML Documents Used 

The set of XML documents we used for investigating the performance of TreeRePair consists 
of 23 files with different characteristics {cf. Table 6). Most of them were used in past papers 
evaluating various XML compressors and therefore may be familiar to the reader. The original 
files can be obtained from the sources listed in Table 7. In all cases character data, attributes, 
comments, namespace information were removed from the XML files, i. e., the XML documents 
consist only of start tags, end tags and empty element tags. We do so, because, at this time, 
TreeRePair ignores this information and solely concentrates on the XML document tree. 

6.2 Algorithms Used in Comparison 

Basically, we compare our implementation of Re -pair for Trees with two other compression al- 
gorithms based on linear SLCF tree grammars, namely, BPLEX [BLM08] and Extended-Repair 
[Kri08,BHK10]. The former is a sliding-window based linear time approximation algorithm. It 
searches bottom-up in a fixed window for repeating tree patterns. The size of the sliding win- 
dow, the maximal pattern size and the maximal rank of a nonterminal can be specified as input 
parameters. One of the main drawbacks of BPLEX is that there exists only a slowly running 
implementation of it. 

Extended-Repair (which we sometimes call E-Repair in the sequel) is an algorithm devel- 
oped by a group from the University of Paderborn, Germany [Kri08,BHK10]. This algorithm is, 
just like our Re-pair for Trees algorithm, based on the Re-pair algorithm introduced in [LMOO]. 
However, it was independently developed and exhibits some fundamental differences to our al- 
gorithm. One of the main differences is that the Extended-Repair algorithm at first generates a 
DAG of the input tree and then processes each part of it individually, /. e., it generates multiple 
grammars which are combined in the end. The individual parts of the input tree are called "re- 
pair packets". The maximal size of each packet can be specified by an input parameter (default 
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XML document File 


size (kb) 


# Edges \ 


Depth # Element 


types ; 


Source 


1998statistics 


349 


28 305 


5 


46 


1 


catalog-01 


4219 


225 193 


7 


50 


9 


catalog-02 


44656 


2 390230 


7 


53 


9 


dictionary-Ol 


1737 


277 071 


7 


24 


9 


dictionary-02 


17128 


2731763 


7 


24 


9 


dblp 


117 822 10 802123 


5 


35 


2 


EnWikiNew 


4 843 


404 651 


4 


20 


3 


EnWikiQuote 


3134 


262954 


4 


20 


3 


EnWikiSource 


13 457 


1133 534 


4 


20 


3 


EnWikiVersity 


5 887 


495 838 


4 


20 


3 


EnWikTionary 


99 201 


8 385 133 


4 


20 


3 


EXI-Array 


5 347 


226 522 


9 


47 


5 


EXI-factbook 


1214 


55 452 


4 


199 


5 


EXI-Invoice 


266 


15 074 


6 


52 


5 


EXI-Telecomp 


3 700 


177 633 


6 


39 


5 


EXI-weblog 


1104 


93 434 


2 


12 


5 


JST_gene.chrl 


4202 


216400 


6 


26 


8 


JST.snp.chrl 


13 795 


655 945 


7 


42 


8 


medline02n0328 


51751 


2 866079 


6 


78 


6 


NCBI_gene . chrl 


6 862 


360 349 


6 


50 


8 


NCBI.snp.chrl 


63 941 


3 642224 


3 


15 


8 


sprot39 . dat 


111175 10903 567 


5 


48 


7 


treebank 


19551 


2447 726 


36 


251 


4 



Table 6: Characteristics of the XML documents used in our tests. The values in the "Source"-column match the source IDs in 
Table 7. The depth of an XML document tree specifies the length (number of edges) of the longest path from the root 
of the tree to a leaf. 



is 20000 edges). The author of [Kri08] points out that this packet-based behavior may have a 
negative impact on the compression performance of the Extended-Repair algorithm. Our own 
investigations concerning a TreeRePair version running on the DAG of the input tree instead of 
on the whole tree support this point of view. 

In [Kri08] it is shown that Extended-Repair achieves a much better compression ratio on the 
XML document NCBI_snp .chrl, when the input tree is not broken down into packets (this 
can be achieved by choosing the maximum packet size large enough). However, our experiments 
show that at the same time the memory requirements and the runtime of the Extended-Repair 
algorithm rise drastically. Note that, regarding our algorithm, the DAG representation is merely 
used to save memory resources and is almost completely transparent to the overlying digram 
replacement process (cf. Sect. 4.5 on page 41). 

6.3 Testing Environment 

Our experiments were done on a computer with an Intel(R) Core^'^ 2 Duo CPU T9400 processor, 
four gigabytes of RAM and the Linux operating system. Every algorithm was executed on a 
single processor core, i.e., no algorithm was able to make use of multiprocessing. TreeRePair 
and BPLEX were compiled with the gcc-compiler using the -03 (compile time optimizations) 
and -m32 (/. e., we generated them as 32bit-applications) switches. We were not able to compile 
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ID Source 



1 http://www.cafeconleche.org/examples 

2 http://dblp.uni-trier.de/xml 

3 http://download.wikipedia.org/backup-index.html 

4 http : //w w w.c s . Washington. edu/research/xmldatasets 

5 http://www.w3.org/XML/EXI 

6 http://www.ncbi.nlm.nih.gov/pubmed 

7 http://expasy.org/sprot 

8 http://snp.ims.u-tokyo.ac.jp 

9 http://softbase.uwaterloo.ca/ ddbms/projects/xbench 



Table 7: Sources of the XML documents from Table 6. 


TreeRePair BPLEX E 


-Repair mDAG bin. mDAG 


Edges (%) 2.9 3.4 
#NTs 4753 13 660 


4.1 12.8 18.3 
6522 2075 5 320 


Time (sec) 10 322 
Mem (MB) 47 536 
File size (%) 0.46 0.71 


63 
401 
0.61 



Table 8: Average values of the characteristics of the generated grammars and of the corresponding runs of the algorithms. 



the succ-tool of the BPLEX distribution with compile time optimizations (i.e., using the -03 
switch). This tool is used to apply a succinct coding to a grammar generated by the BPLEX 
algorithm. However, this should not have a great influence on the runtime measured for BPLEX 
since the succ-tool usually executes quite fast compared to the runtime of the actual BPLEX 
algorithm. In contrast, Extended-Repair is an application written in Java^'^ for which we only 
had the bytecode at hand, i.e., we did not have access to the source code of it. We executed 
Extended-Repair using the Java SE Runtime Environment^'^ in version 1 . 6 . 0_15. 

During the execution of the algorithms we always measured their memory usage. We ac- 
complished this by constantly polling the VmRSS-value which is printed out by executing the 
command cat /proc/<pid>/status, where <pid> is the process ID assigned to the al- 
gorithm. In the first second of the execution of an algorithm this value was checked every ten 
milliseconds and after that the frequency was slowly reduced to one second. 

Every time we executed BPLEX we used its default input parameters, namely, window size: 
20000, maximal pattern size: 20, maximal rank: 10. In order to be able to test BPLEX together 
with every file of our set of test XML documents we needed to explicitly allow large stack sizes 
using the standard tool ulimit. 

6.4 Comparison of the Generated Grammars 

In this section, we compare the final grammars generated by the algorithms TreeRePair, BPLEX 
and Extended-Repair. All algorithms were instructed to minimize the number of edges of the 
generated grammar. For TreeRePair, we achieved this behavior by specifying the -optimize 
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TreeRePair BPLEX E-Repair XMill gzip bzip2 
File size (%) 0.45 0.57 0.61 0.47 1.36 0.58 



Time (sec) 


10 


329 


167 


119 <1 


16 


Mem (MB) 


47 


536 


399 


7 


7 


Edges (%) 


3.0 


3.9 


4.1 


- 


- 


#NTs 


2642 


2796 


7 003 


- 


- 



Table 9: Average values of the characteristics of the runs of the three algorithms when making a small size of the output file top 
priority. 



edges input parameter. Regarding Extended-Repair, we used the supplied Conf Edges . xml 
configuration file which is supposed to make Extended-Repair minimize the number of edges. 
The BPLEX algorithm was executed with its default input parameter values and no changes were 
made to the generated grammar (besides pruning nonterminals which are referenced only once 
by using the supplied gprint tool). 

Table 8 shows the average values of the essential characteristics of the final grammars gener- 
ated by the three competing algorithms. The first row shows the average compression factors in 
terms of the number of edges in percent. The edge compression factor is computed as follows: if 
t e T( J^) is the binary representation of the input tree and Q is the final grammar, we obtain the 
edge compression factor by computing \S\/\t\ ■ 100. The second row shows the average number of 
nonterminals of the final grammars. For the sake of completeness, the average runtimes (in sec- 
onds), the average memory usages (in megabytes) and the average file size compression factors 
are also listed. The compression factor in terms of file size specifies the ratio between the size of 
the input file and the file size of the succinct coding of the final grammar in percent. 

We also added two columns to Table 8 showing the average number of edges and the average 
number of nonterminals of the minimal DAGs of the input trees (mDAG) and the minimal DAGs 
of the binary representations of the input trees (bin. mDAG). 

As it can be seen, on average, TreeRePair generates the smallest linear SLCF tree grammars 
(in terms of the number of edges) compared to the other two algorithms. At the same time, its 
grammars exhibit a small number of nonterminals. It outperforms BPLEX and Extended-Repair 
in terms of runtime and memory usage. The speed and moderate requirements on main memory 
are a result of the transparent DAG representation of the input tree and the many optimizations 
we made to the source code of TreeRePair during our investigations. 

Figure 43. 1 on page 63 gives an impression on how each of the three algorithms performs on 
the individual XML documents in terms of the size of the final grammar in edges. For each file, 
the algorithm which generates the largest grammar is set to 100%. In Appendix A.l on page 66 
there is a detailed table listing all relevant characteristics of the runs of the algorithms on the set 
of test XML documents. 

6.5 Comparison of Output File Sizes 

In this section, we concentrate on the sizes of the files generated by the runs of the algorithms 
on our set of test XML documents. In fact, we execute each algorithm in a mode in which the 
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size of the resulting file is made a top priority. For TreeRePair, we achieve this by specifying the 
input parameter - opt imize filesize and for Extended-Repair, we get such a behavior by 
using the supplied Conf Size . xml configuration file and the -s 4 switch. The latter chooses 
a certain succinct coding of the Extended-Repair distribution which is supposed to generate very 
small representations of the generated grammar. Regarding BPLEX, we first apply the supplied 
gprint-tool using the parameters --prune and — threshold 14. After that we use the 
succ-tool of the BPLEX distribution together with the parameter — type 68 to generate a 
Huffman coding-based succinct coding of the corresponding grammar. In [MMS08] it is stated 
that this approach leads to the best compression performance of BPLEX in general (in terms of 
file size). 

In addition to the above three algorithms, we also consider the compression results produced 
by gzip, bzip2'^ and XMill 0.8 [LSOO]. We include them in our comparison to make it easier to 
get a handle for common compression rates and runtimes. The first two algorithms are widely 
used general purpose file compressors which, of course, produce a non-queryable compressed 
representation of the input file. In contrast, XMill is a compressor specialized in compressing the 
structure and, in particular, the character data of XML documents. In fact, it mainly concentrates 
on how to group the character data of an XML document in such a way that it can be efficiently 
compressed by general purpose compressors like gzip. Since its implementation does not exhibit 
a special "only consider the structure of the XML document" mode, it may be unfair to directly 
compare its compression results with those of TreeRePair, BPLEX or Extended-Repair. How- 
ever, we included its compression results, which we obtained using its default input parameters, 
because we were interested in its performance in this setting. 

Table 9 shows the average sizes of the output files generated by the six algorithms mentioned 
above. For the sake of completeness, the average runtime, the average memory usage, the average 
number of edges and the average number of nonterminals are also listed. Again, TreeRePair 
outperforms BPLEX and Extended-Repair regarding all considered characteristics. Surprisingly, 
its queryable output files are even smaller than the non-queryable ones produced by the highly 
optimized gzip and bzip2 algorithms. However, gzip (but interestingly not bzip2) runs much 
faster than TreeRePair on our test data. 

Figure 43.2 gives an impression on how each of the six algorithms performs on the individual 
XML documents in terms of the size of the generated output file. For each file, the algorithm 
which generates the biggest output file is set to 100%. In Appendix A. 2 on page 70 there is a 
detailed table listing all relevant characteristics of the runs of the algorithms on our set of test 
XML documents. 

6.6 Results without DAG Representation 

Table 10 shows a comparison between the compression results of TreeRePair when using and 
when not using, respectively, the DAG representation described in Sect. 4.2 on page 33. The 
left column shows the values obtained when executing TreeRePair with its default parameters 
in edge optimization mode, i.e., we are only using the -optimize edges switch since our 
algorithm uses the DAG representation by default. In contrast, the right column is a result of 

'^ For more information about tlie gzip algorithm, see http://www.gzip.org. For bzip2, see http://www.bzip.org. 
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with DAG without DAG 


Edges (%) 


2.86 2.84 


#NTs 


4 753 4 620 


File size (%) 


0.463 0.459 


Time (sec) 


9.8 11.2 


Mem (MB) 


47 188 



Table 10: Average values of the characteristics of the runs of TreeRePair with and without the DAG representation of the input 
tree. 



Max. rank 1 



Edges (%) 55.02 3.29 2.92 2.89 2.86 2.89 2.89 
#NTs 1 265 5539 4712 4916 4753 4956 4958 

File size (%) 2.12 0.51 0.47 0.47 0.46 0.47 0.46 
Time (sec) 7.0 8.4 9.3 9.5 9.6 9.8 9.8 
Mem (MB) 44 44 45 47 47 47 47 



Table 11: Average values of the characteristics of the runs of TreeRePair with different maximal ranks allowed for a nonterminal. 



running TreeRePair with the -no_dag and -optimize edges switches. Again, in Appendix 
A.3 on page 75, there is a detailed table listing all relevant characteristics of the runs of the two 
TreeRePair configurations on each test XML document. 

Regarding the differences between the compression results of TreeRePair and the ones of the 
competing algorithms, it can be said that the DAG representation only has a minor impact on 
the compression performance of our algorithm. However, we can state that it drastically reduces 
the memory demands of TreeRePair — it slashes the memory consumption by a factor of 4. 
Interestingly, even without the DAG representation, TreeRePair uses only half as much main 
memory as Extended-Repair does (cf. Table 8). Furthermore, the DAG representation leads to a 
faster compression speed since it saves repetitive recalculations concerning equal subtrees. 

6.7 Results with Different Maximal Ranks 

We executed TreeRePair using the - opt imi z e edge s (z. e. , we enabled the edge optimization 
mode) and the -max.rank switches. Each time, we specified a different maximal rank for a 
nonterminal in order to get information concerning its influence on the compression performance. 
Table 1 1 shows that, regarding our set of test XML documents, a maximal rank of 4 leads to the 
best compression results on average. 

At the same time, we can see that even when restricting the maximal rank to 1 TreeRePair 
performs better than BPLEX and Extended-Repair (cf. Table 8). The fact that large maximal 
ranks can lead to a worse compression ratio can be explained by the trees from Sect. 3.7 on 
page 24. Note that the trees from this section are basically long lists. Although this is not the 
case for our test trees, their shape is nevertheless similar to a list structure. In any case, its quite 
distinct from the shape of a full binary tree, where an unlimited maximal rank leads to the best 
compression ratio (cf. Sect. 3.6 on page 19). 

64 



References 

[BGK03] Peter Buneman, Martin Grohe, and Christoph Koch. Path queries on compressed XML. In VLDB 2003: Proceedings 

of the 29th international conference on very large data bases, pages 141-152. VLDB Endowment, 2003. 
[BHKIO] Stefan Bottcher, Rita Hartel, and Christoph Krislin. CluX: Clustering XML sub-trees. In ICEIS 2010: Proceedings 

of the 12th International Conference on Enterprise Information Systems, 2010. 
[BLM08] Giorgio Busatto, Markus Lohrey, and Sebastian Maneth. Efficient memory representation of XML document trees. 

Information Systems, 33(4-5):456 - 474, 2008. 
[BPSM+08] Tim Bray, Jean Paoli, C. Michael Sperberg-McQueen, Eve Maler, and Franjois Yergeau. Extensible markup 

language (XML) 1.0. W3c recommendation, XML Core Working Group, World Wide Web Consortium, November 

2008. 
[CDG^07] H. Comon, M. Dauchet, R. Gilleron, C. Loding, F. Jacquemard, D. Lugiez, S. Tison, and M. Tommasi. Tree 

automata techniques and applications, http://www.grappa.univ-lille3.fr/tata, 2007. 
[CLL+05] Moses Charikar, Eric Lehman, April Lehman, Ding Liu, Rina Panigrahy, Manoj Prabhakaran, Amit Sahai, and Abhi 

Shelat. The smallest grammar problem. IEEE Trans. Inform. Theory, 51(7):2554-2576, 2005. 
[Deu96] P. Deutsch. DEFLATE compressed data format specification version 1.3. http://tools.ietf.org/html/rfcl951, 1996. 
[FGK03] Markus Frick, Martin Grohe, and Christoph Koch. Query evaluation on compressed trees (extended abstract). In 

Lies '03: Proceedings of the 18th Annual IEEE Symposium on Logic in Computer Science, pages 188-197. IEEE 

Computer Society Press, 2003. 
[Kri08] Christoph Krislin. Optimierung grammatik-basierter XML-Kompression. Diplomarbeit, Faculty for Electrical 

Engineering, Computer Science and Mathematics, University of Paderborn (Germany), 2008. 
[LMOO] N. Jesper Larsson and Alistair Moffat. Off-line dictionary-based compression. Proceedings of the IEEE, 

88(11): 1722-1732, 2000. 
[LM06] Markus Lohrey and Sebastian Maneth. The complexity of tree automata and XPath on grammar-compressed trees. 

Theoretical Computer Science, 363(2):196 - 210, 2006. 
[LMSS09] Markus Lohrey, Sebastian Maneth, and Manfred Schmidt-Schauss. Parameter reduction in grammar-compressed 

trees. In Proceedings of FOSSACS 2009, number 5504 in Lecture Notes in Computer Science, pages 212-226. 

Springer, 2009. 
[LSOO] H. Liefke and D. Suciu. XMill: an efficient compressor for XML data. In Proceedings of the 2000 ACM SIGMOD 

International Conference on Management of Data, page 164. ACM Press, 2000. 
[MLMK05] Makoto Murata, Dongwon Lee, Murali Mani, and Kohsuke Kawaguchi. Taxonomy of XML schema languages 

using formal language theory. ACM Transactions on Internet Technology, 5(4):660-704, 2005. 
[MMS08] Sebastian Maneth, Nikolay Mihaylov, and Sherif Sakr. XML tree structure compression. International Workshop 

on Database and Expert Systems Applications, pages 243-247, 2008. 
[MSV03] Tova Milo, Dan Suciu, and Victor Vianu. Typechecking for XML transformers. Journal of Computer and System 

Sciences, 66(1):66 - 97, 2003. 
[Nev02] Frank Neven. Automata theory for XML researchers. SIGMOD Record, 31(3):39-46, 2002. 
[WLH07] Fangju Wang, Jing Li, and Hooman Homayounfar. A space efficient XML DOM parser. Data <&. Knowledge 

Engineering, 60(1):185 - 207, 2007. 



65 



A Detailed Test Results 

A.l Optimization of Total Number of Edges 



Algorithm Edges File size #NTs Time Mem (MB) 







1998statistics 






TreeRePair 


1.68% 


0.20% 54 


100ms 


1 


BPLEX 


1.80% 


0.34% 168 


1.813s 


295 



E-Repair 1.69% 0.24% 37 7.518s 



bin. mDAG 8.49% 



mDAG 



4.87% 



31 
15 



114 







catalog 


-01 






TreeRePair 


1.69% 


0.10% 


400 


887ms 


2 


BPLEX 


2.22% 


0.22% 


1251 


6.548s 


315 


E-Repair 


1.63% 


0.12% 


291 


9.975s 


279 


bin. mDAG 


3.10% 


- 


520 


- 


- 


mDAG 


3.80% 


- 


506 


- 


- 






catalog 


-02 






TreeRePair 


1.11% 


0.07% 


965 


9.409s 


10 


BPLEX 


1.38% 


0.11% 


3045 


30s 


512 


E-Repair 


1.52% 


0.11% 


1499 


42s 


511 


bin. mDAG 


2.22% 


- 


805 


- 


- 


mDAG 


1.39% 


- 


792 


- 


- 






dblp 








TreeRePair 


3.89% 


0.59% 


25250 


43s 


227 


BPLEX 


4.27% 


0.73% 


38712 57m 42s 


1644 


E-Repair 


5.65% 


0.68% 


30430 


4m 34s 


510 


bin. mDAG 


19.36% 


- 


6592 


- 


- 


mDAG 


11.11% 


- 


3378 


- 


- 






dictionary-01 






TreeRePair 


7.72% 


1.54% 


1676 


1.010s 


9 


BPLEX 


8.43% 


2.37% 


3994 


44s 


323 


E-Repair 


8.71% 


1.83% 


1248 


16s 


433 


bin. mDAG 


27.99% 


- 


2058 


- 


- 


mDAG 


21.07% 


- 


448 


- 


- 
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Edges 


File size 


#NTs 


Time Mem (MB) 






dictionary -02 






TreeRePair 


5.92% 


1.38% 


9757 


Us 


69 


BPLEX 


6.58% 


1.95% 


23209 


6m 12s 


587 


E-Repair 


8.52% 


1.83% 


11672 


lm40s 


494 


bin. mDAG 


24.93% 


- 


16281 


- 


- 


mDAG 


19.96% 


- 


2414 


- 


- 






EnWikiNew 






TreeRePair 


2.29% 


0.21% 


667 


1.585s 


8 


BPLEX 


2.40% 


0.30% 


1369 


35s 


337 


E-Repair 


2.42% 


0.24% 


476 


12s 


347 


bin. mDAG 


17.31% 


- 


23 


- 


- 


mDAG 


8.67% 


- 


29 


- 


- 






EnWikiQuote 






TreeRePair 


2.42% 


0.21% 


452 


1.158s 


7 


BPLEX 


2.56% 


0.31% 


985 


25s 


321 


E-Repair 


2.58% 


0.26% 


323 


9.924s 


290 


bin. mDAG 


18.14% 


- 


19 


- 


- 


mDAG 


9.09% 


- 


25 


- 


- 






EnWikiSource 






TreeRePair 


1.10% 


0.10% 


861 


4.927s 


26 


BPLEX 


1.28% 


0.16% 


1895 


lm9s 


418 


E-Repair 


1.82% 


0.18% 


1106 


23s 


500 


bin. mDAG 


17.52% 


- 


19 


- 


- 


mDAG 


8.77% 


- 


24 


- 


- 






EnWikiVersity 






TreeRePair 


1.44% 


0.13% 


525 


2.107s 


12 


BPLEX 


1.53% 


0.18% 


1043 


34s 


347 


E-Repair 


1.61% 


0.15% 


423 


12s 


437 


bin. mDAG 


17.60% 


- 


19 


- 


- 


mDAG 


8.81% 


- 


24 


- 


- 






EnWikTionary 






TreeRePair 


0.97% 


0.11% 


4535 


36s 


183 


BPLEX 


1.09% 


0.14% 


6402 


8m 58s 


1287 


E-Repair 


1.48% 


0.15% 


6315 


lm33s 


540 


bin. mDAG 


17.32% 


- 


26 


- 


- 


mDAG 


8.66% 


- 


30 


- 


- 
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EXI-Array 






TreeRePair 


0.41% 


0.03% 


123 


1.281s 


14 


BPLEX 


0.65% 


0.06% 


383 


42s 


322 


E-Repair 


0.53% 


0.05% 


142 


8.017s 


320 


bin. mDAG 


56.51% 


- 


8 


- 


- 


mDAG 


42.20% 


- 


13 


- 


- 






EXI-factbook 






TreeRePair 


2.35% 


0.31% 


145 


271ms 


2 


BPLEX 


4.11% 


0.77% 


1423 


5.138s 


298 


E-Repair 


2.58% 


0.31% 


146 


lis 


408 


bin. mDAG 


9.16% 


- 


236 


- 


- 


mDAG 


8.07% 


- 


293 


- 


- 






EXI-Invoice 






TreeRePair 


0.68% 


0.21% 


14 


74ms 


1 


BPLEX 


0.62% 


0.30% 


40 


1.483s 


293 


E-Repair 


0.93% 


0.24% 


20 


4.689s 


119 


bin. mDAG 


13.74% 


- 


6 


- 


- 


mDAG 


7.12% 


- 


15 


- 


- 






EXI-Telecomp 






TreeRePair 


0.07% 


0.01% 


21 


780ms 


3 


BPLEX 


0.06% 


0.02% 


47 


9.684s 


310 


E-Repair 


0.08% 


0.02% 


21 


lis 


452 


bin. mDAG 


11.15% 


- 


10 


- 


- 


mDAG 


5.59% 


- 


15 


- 


- 






EXI-weblog 






TreeRePair 


0.06% 


0.01% 


13 


324ms 


3 


BPLEX 


0.04% 


0.01% 


24 


9.097s 


303 


E-Repair 


0.05% 


0.02% 


11 


7.868s 


279 


bin. mDAG 


18.19% 


- 


2 


- 


- 


mDAG 


9.10% 


- 


2 


- 


- 






JST_gene 


xhrl 






TreeRePair 


1.84% 


0.10% 


354 


874ms 


3 


BPLEX 


2.19% 


0.19% 


1113 


lis 


315 


E-Repair 


2.99% 


0.17% 


126 


8.006s 


233 


bin. mDAG 


6.75% 


- 


114 


- 


- 


mDAG 


4.24% 


- 


76 


- 


- 
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JST^snp.chrl 






TreeRePair 


1.51% 


0.09% 856 


3.150s 


8 


BPLEX 


2.15% 


0.21% 4193 


31s 


360 


E-Repair 


1.54% 


0.10% 634 


15s 


445 


bin. mDAG 


6.20% 


282 


- 


- 


mDAG 


3.59% 


242 


- 


- 






medline02n0328 






TreeRePair 


4.13% 


0.35% 9064 


16s 


79 


BPLEX 


5.17% 


0.62% 33976 


5m 52s 


574 


E-Repair 


6.73% 


0.54% 13010 


Im 32s 


479 


bin. mDAG 


25.84% 


- 20013 


- 


- 


mDAG 


22.80% 


- 3960 


- 


- 






NCBLgene.chrl 






TreeRePair 


1.37% 


0.09% 504 


1.374s 


4 


BPLEX 


2.38% 


0.28% 3631 


14s 


327 


E-Repair 


1.68% 


0.11% 328 


10s 


308 


bin. mDAG 


3.98% 


605 


- 


- 


mDAG 


4.45% 


- 436 


- 


- 






NCBUnp.chrl 






TreeRePair 


<0.01% 


<0.01% 17 


15s 


80 


BPLEX 


<0.01% 


<0.01% 23 


2m 6s 


770 


E-Repair 


0.03% 


0.01% 291 


37s 


504 


bin. mDAG 


22.22% 


2 


- 


- 


mDAG 


11.11% 


2 


- 


- 






sprot39.dat 






TreeRePair 


2.30% 


0.38% 20224 


43s 


178 


BPLEX 


3.16% 


0.79% 111167 14m 41s 


1446 


E-Repair 


4.27% 


0.59% 33102 


3m 48s 


499 


bin. mDAG 


13.18% 


- 31116 


- 


- 


mDAG 


16.07% 


- 10243 


- 


- 






treebank 






TreeRePair 


20.72% 


AA\% 32857 


22s 


164 


BPLEX 


23.29% 


6.16% 76109 21m 27s 


645 


E-Repair 


34.85% 


6.03% 48358 


6m 50s 


526 


bin. mDAG 


59.42% 


- 43586 


- 


- 


mDAG 


53.75% 


- 24746 


- 


- 
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A.2 Optimization of File Size 



Algorithm 


Edges 


File size #NTs 


Time Mem (MB) 






1998statistics 






TreeRePair 


1.77% 


0.20% 35 


109ms 


1 


BPLEX 


2.19% 


0.25% 27 


2.018s 


295 


E-Repair 


1.68% 


0.24% 37 


4.578s 


108 


bzip2 


- 


0.29% 


229ms 


4 


gzip 


- 


0.81% 


8ms 


- 


XMill 


- 


0.24% 


2.728s 


2 






catalog-01 






TreeRePair 


1.76% 


0.10% 279 


898ms 


2 


BPLEX 


2.23% 


0.14% 342 


6.834s 


315 


E-Repair 


2.77% 


0.19% 236 


lis 


349 


bzip2 


- 


0.24% 


2.701s 


8 


gzip 


- 


0.85% 


51ms 


- 


XMill 


- 


0.11% 


12s 


2 






catalog-02 






TreeRePair 


1.12% 


0.07% 770 


10s 


10 


BPLEX 


1.27% 


0.08% 948 


32s 


512 


E-Repair 


1.49% 


0.12% 1692 


47s 


521 


bzip2 


- 


0.23% 


28s 


8 


gzip 


- 


0.81% 


450ms 


- 


XMill 


- 


0.09% 


lm58s 


12 






dblp 






TreeRePair 


4.03% 


0.58% 14533 


43s 


227 


BPLEX 


4.52% 


0.65% 11693 61m 15s 


1644 


E-Repair 


5.52% 


0.68% 35125 42m 48s 


516 


bzip2 


- 


0.56% 


Imlls 


8 


gzip 


- 


1.30% 


1.230s 


- 


XMill 


- 


0.53% - : 


llm36s 


15 






dictionary-01 






TreeRePair 


8.08% 


\A1% 930 


1.117s 


9 


BPLEX 


9.67% 


1.85% 1044 


46s 


323 


E-Repair 


8.51% 


1.81% 1428 


19s 


462 


bzip2 


- 


1.52% 


1.313s 


7 


gzip 


- 


3.07% 


39ms 


- 


XMill 


- 


1.49% 


17s 


2 
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dictionary -02 






TreeRePair 


6.15% 


\31% 


5024 


Us 


69 


BPLEX 


7.56% 


1.63% 


5424 


6m 12s 


587 


E-Repair 


8.30% 


1.81% : 


13698 


lm57s 


475 


bzip2 


- 


1.52% 


- 


15s 


7 


gzip 


- 


3.05% 


- 


279ms 


- 


XMill 


- 


1.49% 


- 


2m 41s 


13 






EnWikiNew 






TreeRePair 


2.38% 


0.20% 


390 


1.721s 


8 


BPLEX 


2.63% 


0.23% 


335 


35s 


337 


E-Repair 


2.42% 


0.24% 


476 


12s 


369 


bzip2 


- 


0.26% 


- 


2.999s 


8 


gzip 


- 


0.90% 


- 


57ms 


- 


XMill 


- 


0.23% 


- 


23s 


2 






EnWikiQuote 






TreeRePair 


2.51% 


0.20% 


274 


1.195s 


7 


BPLEX 


2.81% 


0.23% 


236 


25s 


321 


E-Repair 


2.58% 


0.26% 


323 


10s 


268 


bzip2 


- 


0.28% 


- 


2.013s 


8 


gzip 


- 


0.93% 


- 


36ms 


- 


XMill 


- 


0.24% 


- 


15s 


2 






EnWikiSource 






TreeRePair 


1.14% 


0.10% 


515 


5.025s 


26 


BPLEX 


1.40% 


0.13% 


535 


ImlOs 


418 


E-Repair 


1.82% 


0.18% 


1127 


23s 


488 


bzip2 


- 


0.16% 


- 


8.742s 


8 


gzip 


- 


0.63% 


- 


131ms 


- 


XMill 


- 


0.12% 


- 


lm4s 


9 






EnWikiVersity 






TreeRePair 


1.50% 


0.12% 


303 


2.244s 


12 


BPLEX 


1.70% 


0.15% 


287 


36s 


347 


E-Repair 


1.61% 


0.15% 


423 


13s 


415 


bzip2 


- 


0.19% 


- 


3.698s 


8 


gzip 


- 


0.69% 


- 


59ms 


- 


XMill 


- 


0.15% 


- 


28s 


2 
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EnWikTionary 






TreeRePair 


1.00% 


0.11% 


2575 


37s 


183 


BPLEX 


1.15% 


0.13% 


2062 


9m 13s 


1287 


E-Repair 


1.48% 


0.15% 


6314 


lm40s 


526 


bzip2 


- 


0.17% 


- 


57s 


8 


gzip 


- 


0.68% 


- 


938ms 


- 


XMill 


- 


0.13% 


- 


7m 25s 


15 






EXI-Array 






TreeRePair 


0.44% 


0.03% 


75 


1.393s 


14 


BPLEX 


0.77% 


0.05% 


124 


43s 


322 


E-Repair 


0.51% 


0.05% 


155 


7.833s 


312 


bzip2 


- 


0.05% 


- 


3.250s 


8 


gzip 


- 


0.37% 


- 


67ms 


- 


XMill 


- 


0.03% 


- 


10s 


6 






EXI-factbook 






TreeRePair 


2.51% 


0.31% 


99 


356ms 


2 


BPLEX 


6.44% 


0.58% 


170 


5.333s 


298 


E-Repair 


2.59% 


0.31% 


151 


12s 


438 


bzip2 


- 


0.78% 


- 


854ms 


8 


gzip 


- 


1.10% 


- 


17ms 


- 


XMill 


- 


0.29% 


- 


5.248s 


1 






EXI-Invoice 






TreeRePair 


0.72% 


0.21% 


11 


147ms 


2 


BPLEX 


0.78% 


0.28% 


8 


1.406s 


293 


E-Repair 


0.91% 


0.24% 


21 


4.320s 


113 


bzip2 


- 


0.30% 


- 


191ms 


3 


gzip 


- 


0.64% 


- 


7ms 


- 


XMill 


- 


0.26% 


- 


1.256s 


2 






EXI-Telecomp 






TreeRePair 


0.08% 


0.01% 


12 


829ms 


3 


BPLEX 


0.07% 


0.02% 


15 


9.548s 


310 


E-Repair 


0.08% 


0.02% 


24 


13s 


450 


bzip2 


- 


0.09% 


- 


2.363s 


8 


gzip 


- 


0.45% 


- 


36ms 


- 


XMill 


- 


0.02% 


- 


lis 


2 
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EXI-weblog 






TreeRePair 


0.06% 


0.01% 


9 


400ms 


3 


BPLEX 


0.05% 


0.01% 


12 


9.004s 


303 


E-Repair 


0.05% 


0.02% 


12 


7.942s 


288 


bzip2 


- 


0.06% 


- 


720ms 


8 


gzip 


- 


0.40% 


- 


14ms 


- 


XMill 


- 


0.02% 


- 


8.342s 


2 






JST_gene 


\chrl 






TreeRePair 


1.91% 


0.10% 


111 


906ms 


3 


BPLEX 


2.42% 


0.13% 


211 


lis 


315 


E-Repair 


2.99% 


0.17% 


128 


9.947s 


211 


bzip2 


- 


0.14% 


- 


2.599s 


8 


gzip 


- 


0.67% 


- 


43ms 


- 


XMill 


- 


0.10% 


- 


14s 


2 






JST_snp. 


chrl 






TreeRePair 


1.58% 


0.08% 


537 


3.213s 


8 


BPLEX 


2.45% 


0.14% 


569 


32s 


360 


E-Repair 


1.51% 


0.10% 


673 


15s 


453 


bzip2 


- 


0.18% 


- 


9.251s 


8 


gzip 


- 


0.79% 


- 


149ms 


- 


XMill 


- 


0.09% 


- 


40s 


8 




medline02n0328 






TreeRePair 


4.32% 


0.34% 


4923 


16s 


79 


BPLEX 


6.47% 


0.46% 


6717 


5m 45s 


574 


E-Repair 


6.71% 


0.54% : 


13243 


lm38s 


477 


bzip2 


- 


0.49% 


- 


31s 


7 


gzip 


- 


1.26% 


- 


544ms 


- 


XMill 


- 


0.34% 


- 


2m 13s 


13 




NCBLgene.chrl 






TreeRePair 


1.43% 


0.09% 


354 


1.442s 


4 


BPLEX 


3.00% 


0.16% 


464 


14s 


327 


E-Repair 


1.66% 


0.11% 


342 


10s 


265 


bzip2 


- 


0.15% 


- 


4.110s 


8 


gzip 


- 


0.71% 


- 


65ms 


- 


XMill 


- 


0.08% 


- 


21s 


8 



73 



Algorithm 


Edges 


File size #NTs 


Time Mem (MB) 






NCBUnp.chrl 






TreeRePair<0.01% 


<0.01% 11 


15s 


80 


BPLEX 


<0.01% 


<0.01% 15 


2m 6s 


770 


E-Repair 


0.03% 


0.01% 292 


33s 


465 


bzip2 


- 


0.03% 


40s 


8 


gzip 


- 


0.39% 


578ms 


- 


XMill 


- 


0.00% 


3m 45s 


14 






sprot39.dat 






TreeRePair 


2.41% 


031% 11699 


43s 


178 


BPLEX 


4.33% 


0.53% 11783 13m 43s 


1446 


E-Repair 


4.25% 


0.59% 33700 


3m 59s 


497 


bzip2 


- 


0.45% 


Imlls 


8 


gzip 


- 


1.20% 


1.122s 


- 


XMill 


- 


0.36% 


9m 52s 


15 






treebank 






TreeRePair 


21.59% 


M%% 17186 


22s 


164 


BPLEX 


26.21% 


5.37% 21302 21m 36s 


646 


E-Repair 


34.53% 


6.01% 51470 


7m 44s 


514 


bzip2 


- 


5.26% 


6.407s 


7 


gzip 


- 


9.65% 


843ms 


- 


XMill 


- 


4.51% 


Im 36s 


12 
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A.3 Without Using DAG Representation 



Algorithm 


Edges File size #NTs Time Mem (MB) 




1998statistics 




Without DAG 


1.62% 0.20% 53 121ms 


4 


With DAG 


1.68% 0.20% 54 214ms 


1 




catalog-01 




Without DAG 


1.69% 0.10% 400 1.381s 


20 


With DAG 


1.69% 0.10% 400 1.022s 


3 




catalog-02 




Without DAG 


1.11% 0.07% 967 15s 


199 


With DAG 


1.11% 0.07% 965 9.584s 


10 




dblp 




Without DAG 


3.89% 0.59% 25039 55s 


1015 


With DAG 


3.89% 0.59% 25250 44s 


227 




dictionary-01 




Without DAG 


7.63% 1.51% 1622 1.238s 


25 


With DAG 


7.72% 1.54% 1676 1.044s 


9 




dictionary-02 




Without DAG 


5.88% 1.36% 9390 12s 


238 


With DAG 


5.92% 1.38% 9757 lis 


69 




EnWikiNew 




Without DAG 


2.28% 0.21% 656 2.042s 


37 


With DAG 


2.29% 0.21% 667 1.732s 


8 




EnWikiQuote 




Without DAG 


2.41% 0.21% 458 1.320s 


24 


With DAG 


2.42% 0.21% 452 1.223s 


7 




EnWikiSource 




Without DAG 


1.09% 0.10% 863 5.652s 


101 


With DAG 


1.10% 0.10% 861 5.087s 


26 




EnWikiVersity 




Without DAG 


1.43% 0.13% 522 2.472s 


45 


With DAG 


1.44% 0.13% 525 2.229s 


12 




EnWikTionary 




Without DAG 


0.97% 0.11% 4539 42s 


743 


With DAG 


0.97% 0.11% 4535 38s 


183 
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Algorithm Edges File size #NTs Time Mem (MB) 



Without DAG 
With DAG 



EXI-Array 

0.40% 0.03% 122 1.378s 
0.41% 0.03% 123 1.394s 



21 
14 





EXI-factbook 




Without DAG 


2.34% 0.31% 144 331ms 


6 


With DAG 


2.35% 0.31% 145 330ms 


2 




EXI-Invoice 




Without DAG 


0.61% 0.21% 12 85ms 


3 


With DAG 


0.68% 0.21% 14 124ms 


1 




EXI-Telecomp 




Without DAG 


0.06% 0.01% 17 1.132s 


17 


With DAG 


0.07% 0.01% 21 850ms 


3 




EXI-weblog 




Without DAG 


0.05% 0.01% 10 607ms 


10 


With DAG 


0.06% 0.01% 13 400ms 


3 




JST_gene.chrl 




Without DAG 


1.73% 0.09% 299 1.365s 


21 


With DAG 


1.84% 0.10% 354 910ms 


3 




JST snp. chrl 




Without DAG 


1.50% 0.09% 841 4.187s 


59 


With DAG 


1.51% 0.09% 856 3.287s 


8 




medline02n0328 




Without DAG 


4.11% 0.34% 8524 17s 


235 


With DAG 


4.13% 0.35% 9064 17s 


79 




NCBLgene.chrl 




Without DAG 


1.37% 0.09% 486 1.959s 


32 


With DAG 


1.37% 0.09% 504 1.498s 


4 




NCBUnp.chrl 




Without DAG 


< 0.01% < 0.01% 13 18s 


337 


With DAG 


< 0.01% < 0.01% 17 15s 


80 




sprot39.dat 




Without DAG 


2.31% 0.37% 18516 55s 


936 


With DAG 


2.30% 0.38% 20224 44s 


178 




tree bank 




Without DAG 


20.71% 4.41% 32786 14s 


215 


With DAG 


20.72% 4.41% 32857 22s 


164 
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